The Payment Card Industry (PCI) Data Security Standard (DSS) defines a set of guidelines for securing credit card data in order to minimize theft, fraud, and misuse. Any cloud application that accepts, transmits, or stores cardholder data must comply with the PCI DSS. It is known to be very time consuming and exhaustive in preparing for first time implementation as well as for maintaining ongoing compliance. It can take on an average of 3 to 6 months to build out the infrastructure. In terms of staffing, for a modest size infrastructure of 50 VMs, we see about one devops, one secops and one infosec engineer each, working full time.
But why is this so hard and time consuming? In this blog we break down the process and highlight some of the key challenges of having PCI DSS compliant controls in place.
There is a control matrix of 76 applicable controls that are required. We will use this as the basis of the time and effort estimates and discuss other factors that impact the process. There are 5 key challenges that we focus on in this blog.
Challenge 1: Infrastructure-as-code is not a Remedy
Infrastructure as code is the new trend in terms of maintaining all the infrastructure definitions using a set of files and keeping them in a version control system for review, change management and visibility. Although it helps in some ways, it also means that cloud operators now have to learn a new declarative language and express all of their intent correctly in that language. As infrastructure grows, it becomes more and more complex to guarantee that all the created infrastructure is secure, compliant and following best practices. These include writing terraform templates, cloud formation templates, ansible scripts, python and bash programs. The tools orchestrated range from AWS native services like VPC, Security Groups, EC2, KMS to ISV softwares like Alert logic to open source tools like ClamAV, Wazuh, Tripwire and so on. Given the diversity of tools and configuration, as the number of resources increase, it gets harder to write code, test code, review it and roll out changes to production.In fact, 2020 Cloud threat report released by Palo Alto Networks identifies around 200,000 potential vulnerabilities in existing Infrastructure-as-code templates.
Table 1 shows a snippet of the PCI controls with a sample implementation using Infrastructure-as-code technique. If we complete this costing exercise for each of the 74 applicable PCI control then we could say that
If each control were to take on an average 2 days to implement and test, then for 74 operational controls for any infrastructure with 50 virtual machines and a single payment card application typically takes about 148 days or about 6 months. And, even after this automation, one has to keep the system updated with ongoing code changes based on evolving application needs common in fast growing SAAS companies.
The bigger the size of infrastructure, the more elaborate the needed controls become resulting in larger implementations & code bases. As the size of the code base grows, it becomes harder to make new changes because one has to understand more existing code, test for regression and code review cycles become longer. The size of the devops team will grow, further increasing operational expenditure. Moreover, different devops engineers have preference for their own favorite language. So whenever there is churn in any organization, the new team may choose to use their own favorite tool or language leading to a dreadful mix of code floating around. Eventually it becomes a burden and slows down the development process instead of making it faster.
Challenge 2: Compliance is an afterthought
Commonly, compliance is an after thought especially the case in fast growing companies with limited resources. The foundation for the infrastructure provisioning and automation architecture at the devops layer is in place before compliance requirements are considered as product development and go to market are the first priorities.
The guidelines are strict and many ignored at the DevOps layer. “Production access should be strictly controlled, time bound and only need-basis” is easier said than done. Many changes are dependent on the initial setup and require reprovisioning of resources like the separation of VPCs between production and staging, moving unencrypted databases etc. Technologies like Kubernetes recommend wide open ports across all nodes in the cluster. Below is an official security group recommendation from AWS.
This one recommendation will blacklist EKS from an Infosec perspective. Imagine how DevOps will take that? Now we get into complexities of separating worker nodes into separate security groups that w/o complex automation will break KubeDNS, Ingress controllers and Service-to-service communications.
Challenge 3: Lack of cross disciple Expertise across DevOps and Infosec
In typical PCI and HIPAA standards about 70% of the non-policy controls are at the resource provisioning or the devops layer and rest 30% in secops layer. 100% of policy based controls are under the purview of infosec. This makes it harder for one team to own all the controls and often there is finger-pointing across teams to deal with any errors. This poses a great risk to a business where the cost of non-compliance due to anyone’s mistake can be extremely high. In fact, this often leads to centralized controls in a company which again slows down the overall development and deployment of applications.
Devops team and cloud developers lack the skill set or the will to understand the nuances of the policies formulated by the infosec and similarly infosec lacks the will and skill to understand the nuances of the devops configurations and their practical considerations.
Infosec who cover the devops tracks by not requiring them to change any configuration or operational procedure by adding “Acceptable risk” notes in the organization’s infosec policy are best friends to the devops team.
Challenge 4: Need to Unlearn Legacy Enterprise Security Architecture
In the traditional on premise infrastructure world, the IT discipline was split into storage, computer, network, OS and security admins. Invariably, much of the provisioning occurs in storage, compute and network. Security controls are implemented in centralized hardware like firewall and IDS which can be provisioned independently and non intrusively to the rest of the system. Conversely, in public cloud, security controls need to be baked in at the compute provisioning time in a unified devops workflow. This requires an automation skill set which is scarcely available in more traditional Infrastructure security teams. The engineering teams want to adopt cloud at a rapid pace and are perceived as reckless by IT while IT is viewed as a constant roadblock.
Challenge 5: Standards are a favorite Punching Bag
“Compliance standards are archaic!” This could not be farther from truth. But this attitude among devops and engineering teams leads to bad decisions at the implementation phase that hurt enormously when an auditor or infosec rejects the implementation. Compliance controls are by-and-large solid best practices. Look at the security matrix in the AWS PCI DSS guide and try to find a single control that is not a best practice. A control being harder to implement does not make it archaic. In fact, most problems arise from the fact that existing teams need to adapt to cloud-based best practices to meet the controls and learn new tricks.
Implementing a highly secure and compliant cloud infrastructure is far from a solved problem, even with today’s automation tools and scripting languages. The management of resources, scripts and code gets worse over time and makes the process error prone and slow. Further, the skill-set needed now includes both operations expertise and writing good code, which means that finding and hiring cloud operators who have the required know-how remains a challenge. Sourcing infosec engineers with operational expertise is even harder.
So the real question is: Can we have a different approach where machines can convert human intent and high level declarative specification in terms of product architecture, scale, security, compliance and auto-generate all the code, resources needed for a fully compliant cloud infrastructure?
Can we go from Infrastructure as code to No-code based automation?
After all, inside AWS and Azure, a few hundred people run over 20 Million workloads across the globe with 99.99% availability, infinite scale and compliant to all regulatory standards.