This article describes how Square issues identity to applications in AWS Lambda so they can make authenticated calls to applications running in our data center, and why we chose mutual TLS to do so. A previous post describes how we use AWS Lambda at Square on a high level, while diving into security aspects here.
In short, we built a custom certificate issuance system working with short-lived certificates through AWS Private Certificate Authority (PCA) so we never expose the root key and have an audit trail of all issued certificates. Lambdas hold private keys and Certificate Signing Requests (CSRs) in Secrets Manager where they are protected with IAM (Identity and Access Management) policies and Service Control Policies (SCPs).
The certificates we issue are SPIFFE-compatible, which is an open standard for zero trust networks. To keep cold start time low and keep Lambdas lightweight, certificates are issued ahead of execution time, making sure not to introduce a bottleneck or compromise the security of workloads. We currently use this system in production.
Square is in the process of moving from our data centers to the cloud. This process is in an early phase, with the majority of our processing still happening in the data center where we operate microservices. These microservices are authenticated via mutual TLS (mTLS) on our own Public Key Infrastructure (PKI) to enable secure communication between microservices. The identity of applications is encoded in short-lived x509 certificates. We have been using mTLS authentication for years, and have been gradually migrating services to SPIFFE within our infrastructure using SPIRE servers. In the data center traffic is routed via an Envoy service mesh. As part of the move to the cloud, we decided to support authenticated calls by AWS Lambda into our service mesh. This will enable developers to either build new standalone applications or to break parts of their existing applications out and operate them separately if they are a good fit for serverless.
Lambdas can be used for tasks that happen with unpredictable load spikes, as Lambda can scale up quickly. They are also great for infrequent tasks where developers don’t want to allocate servers permanently, as they would be mostly idle. However, AWS offers no SPIFFE-compatible identity for Lambdas that would allow us to authenticate to the data center. The main challenge we tackled was making authenticated calls into the data center possible, staying compatible with the SPIFFE standard to use authentication infrastructure we have in place in the data center. However, we also had to solve for usability constraints. For example, because the key properties that make Lambda appealing are fast startup time and great scalability, the system we built could not noticeably impact Lambda performance.
We ended up building a custom certificate issuance system that automatically creates certificates for Lambdas ahead of execution time, so certificates are available in Secrets Manager before a Lambda is starting up. Lambdas access their private keys and certificates in Secrets Manager, and we implemented the certificate issuance system in Lambda. This blog post provides details on what we built.
When collecting requirements for this system it soon became clear that using the same approach as in the data center would not be possible. We allocate a SPIRE agent per node and an Envoy sidecar with each microservice in the data center. We need to register workloads on deploy, and the delay through setup and resource use is acceptable for long-lived services. However, for Lambdas that operate with limited resources and need to start up fast, these dependencies are not feasible. Furthermore, Lambdas don’t support sidecars.
These are the goals that we wanted to achieve in this project; some were specific to Lambda, while other security goals are more broadly applicable to identity at Square.
- Identity has to be compatible with our SPIFFE-enabled data center applications, Lambdas should be able to authenticate seamlessly. A Lambda calling a data center application should be no different than calls within the data center.
- Lambda startup time should be kept low to not cancel out the advantages of Lambda. We want to keep performance overhead low so the Lambda can do real work as fast as possible.
- Reduce exposure of private key materials. Specifically never expose the root key, as it means a third party in possession of the root would be able to issue valid identities - this is a major security risk.
- Make identity issuance auditable. We want to be able to tell at any time how many and what certificates have been issued.
- Availability - this is implicit by being a security infrastructure project, secure systems that are not available, are not helpful to developers.
Mutual TLS vs. Signed Requests
AWS is using signed requests for internal authentication, while Square uses mTLS. These are fundamentally different approaches and a decision we had to make is whether to build infrastructure to use signed requests to the data center and authenticate them there. Or, extend Lambda to make it compatible with our data center infrastructure. We haven’t found an established practice that would allow us to use signed calls out of AWS to authenticate with our existing infrastructure. Since we consider Lambdas logically equivalent to our data center applications, developers have expectations for Lambdas to work similarly as well. Ultimately, from Square’s perspective mTLS was the most sound approach as it is well established within our infrastructure and only required the AWS side to change.
Identity of a serverless workload
In this project we aim to have a unified identity solution with the rest of Square that still plays to the specifics of Lambda. We have been gradually switching to use SPIFFE in both the data center and the cloud, where the identity of workloads is described by a URI SAN on x509 certificates. As an implementation of SPIFFE we use SPIRE servers that require workloads to be registered centrally on deploy to receive certificates.
While we wanted to use SPIFFE compatible certificates here as well, there is currently no support for serverless workloads in SPIRE. We implemented a solution ourselves that was tailored to Lambdas: without workload registration, and certificates are issued ahead of time to not delay startup. Staying compliant with the standard meant no additional features had to be added to existing data center applications across all supported languages to connect from Lambdas.
As to granularity - in the data center we issue identity per application. We decided that Lambdas can be applications on their own, but also extensions of existing applications running on different infrastructure. This allows developers to break out parts of an application more easily. Each application runs in a separate AWS account, and we group all Lambdas in one account as the same application sharing identity.
This project depends on Registry, an internal application that serves as the source of truth for identity of applications, including a mapping of application name to AWS Account ID to environment. We rely on Registry to verify accounts are authorized to request certificate signing, and are operating in the correct environment.
This section gives an overview of the architecture of the identity issuance system, which is split into two components both of which interface with PCA. One component is responsible for requesting certificate issuance, and the other one consumes events from PCA via EventBridge to store issued certificates. To perform calls into the data center other components are necessary such as Mesh Proxy, a modified version of Envoy used to route requests. Mesh Proxy is out of scope for this article and will be described in a separate post.
On a high level, identity Issuance is invoked every 30 minutes. It issues certificates that are valid for 24h and re-issues certificates if they are past their half-life. The entire system is implemented as a Lambda application where issuance operates in one account that includes Secrets Manager. Customer Lambdas are executed in their own accounts, but granted access to their secrets in the central Secrets Manager.
The following figure demonstrates issuance of certificates and making calls into the data center end to end. Some steps such as caching of responses and retry logic are omitted for clarity. The steps are described in more detail below.
We execute the following on a cron schedule to make sure certificates are always valid.
- The Cert Issuance component requests from Registry a list of all Lambda enabled applications with their account ID and IAM roles. Registry is an internal application responsible for identity of applications.
- The Cert Issuance component iterates through all available applications and reads stored certificates from Secrets Manager to check their expiration date.
- For applications where the expiration is below a configured threshold a new private key and CSR is generated.
- The private keys are stored in Secrets Manager in a central account - allowing read access only to the account of the target Lambda.
- The CSR is submitted to PCA. These requests are asynchronous - after submitting all CSRs the Cert Issuance component exits.
- The EventBridge listener component is subscribed to PCA issuance events. These include an ARN (Amazon Resource Names) of the issued certificate.
- The EventBridge listener component writes the certificate to Secrets Manager. This write is executed to Secrets Manager in all regions where the customer Lambda is running.
For a Lambda to call into the Square data center, it goes through the following steps. These happen independently of certificate issuance. Much of the logic for calls is happening in a provided Lambda layer that abstracts work with certificates and brokers connections transparently to developers. The layer was developed by our internal cloud foundations team (details here). Mesh Proxy, developed by our internal traffic team, is a modified version of Envoy that performs layer 4 routing of mTLS traffic without terminating encryption. It will be described in a future article. Note that the startup cost for a Lambda to call into the data center is mostly reading from Secrets Manager.
- Lambdas can read their key and certificate from Secrets Manager where they will be available before launch - this is handled transparently to the developer.
- The provided Lambda layer offers a reverse HTTP proxy bound to the Lambda’s localhost that developers can make requests over. Requests are routed via mTLS to Mesh Proxy.
- Mesh proxy routes the Lambda’s S2S call into the data center based on SNI, without terminating TLS.
The main security goals for this project are reducing exposure of private key material, maintaining auditability, and reducing the value of unauthorized access. AWS offers services that can be used as components towards building such a system.
Protecting the Root
AWS Private CA is a Hardware Security Module (HSM) backed Certificate Authority service that we can use without maintaining HSMs ourselves, as we do in the data center. The most valuable asset in this system is the private key of the certificate authority. If an attacker were to retrieve it, they would be able to issue arbitrary certificates to impersonate services at Square. This requires strong security guarantees.
We interact with PCA through the “IssueCertificateRequest” API call - providing CSRs to sign, and receive responses via EventBridge. It is noteworthy that certificates are signed right off the root. This may appear odd at first glance but issuing an intermediate certificate to sign would mean we risk exposing key material should a third party access it. A certificate that allows issuing further certificates outside of PCA would not be auditable, and loss of such credentials could go undetected.
Auditing of certificate issuance can be enabled for PCA. The Audit Report feature writes to a specified S3 bucket and can be collected for security alerting. As these reports are written directly by PCA, even if an intruder were able to issue certificates, their actions would be visible by analyzing the logs.
An alternative to PCA would be storing the private key in Secrets Manager, however, once the key is copied out to sign certificates, tracking access is not possible. In a situation where unauthorized access to the key happens, we couldn’t tell what certificates have been signed.
Reducing the Blast Radius
By reducing the value an attacker would gain from a system, we make it less interesting to be attacked, and also have an upper bound on damage that can be performed. Two of the most important features in this aspect are that all issued certificates are short lived and only allow for fine-grained actions associated with the specific application. Note: this requires Access Control Lists (ACLs) to be enforced throughout the service mesh - which is not a feature of Lambda identity, but rather a property of Square’s security design for microservices.
In this system, we removed unnecessary access wherever possible to limit potential abuse. Keys are generated in a central account and written so the customer Lambda can access them in Secrets Manager. However, the central account is not able to read private keys after they are stored. Furthermore, private keys are only visible when a Lambda is invoked with their associated execution role, limiting exposure of keys to developers. PCA’s audit reports also reduce the value of gaining access to the security account. An intruder would not be able to create certificates arbitrarily without getting noticed.
Something we considered implementing but ultimately didn’t - was generating private keys in the customer account and never exposing it to a central account that interacts with PCA. From a security perspective this seemed ideal. However, managing the generation component from a centralized repository would have led to the same supply chain exposure as a central key generation account. Leaving managing updates to developers would have burdened them with yet another component that could lead to misconfigurations. After weighing pros and cons of this option, we decided the security benefits were not worth the additional burdens we would put on developers. Although security is important, primarily our solution has to provide availability.
We operate this system in two regions for redundancy purposes, in case some component becomes unavailable. The system writes from where it is executed to all regions where a Lambda is present. For the identity issuance system we differentiate between a primary and secondary region, where the secondary region will only start issuing certificates if the age of stored certificates has gone over a threshold that indicates malfunction of the primary region.
However, the way the system is designed, downtime of the system will not lead to immediate degradation. Even when unavailable for several hours it does not affect S2S calls as we issue certificates ahead of time with a 24h validity and refresh at half-life. Connectivity issues only arise for downtime of over 12 hours for applications that previously received certificates.
No article covering AWS would be complete without a pricing section. Cost for this project is divided into several components, such as Secrets Manager but the most interesting and expensive one is PCA. The price is calculated with a monthly fixed cost of $400 and a per-certificate cost that gets cheaper with increasing use. Price is reduced after 1,000 and again after 10,000 issued certificates a month, from $0.75 to $0.35 to $0.001.
This is also relevant for redundancy purposes. When operating two PCAs the cost can vary strongly by how the load is balanced. Signing certificates evenly distributed will be more expensive. When all certificates are signed in one region and the other PCA is used as backup which only signs in case of outages, that region will only cost $400 plus fees for certificate signing while the main PCA was not available.
Square is in the process of moving from the data center to the cloud, and as part of this effort we're enabling developers to use AWS Lambda. In this article we discuss how we tackled enabling authenticated calls from Lambda into the data center. We want developers to be able to call micro services the same way as within the data center. To establish compatibility we use SPIFFE identity, existing SPIRE tooling does not support serverless so we used a combination of components that are made available by AWS and implemented the rest ourselves. The certificate root is backed by an HSM in AWS Private CA and we reduce the blast radius of the central security account by not allowing it to read private keys for workloads. This service runs in production today.
This project was a wide-ranging effort from engineering teams across Square, especially Cloud Foundations, Traffic, and Infrastructure Security teams. Within the Security Infrastructure team Roy Xu contributed strongly.