Airbnb has been migrating its infrastructure to a Service Oriented Architecture (“SOA”). SOA offers many upsides, such as enabling developer specialization and the ability to iterate faster. However, it also poses challenges for billing and payments applications because it makes it more difficult to maintain data integrity. An API call to a service that makes further API calls to downstream services, where each service changes state and potentially has side effects, is equivalent to executing a complex distributed transaction.
To ensure consistency among all services, protocols such as two-phase commit might be used. Without such a protocol, distributed transactions present challenges to maintaining data integrity, allowing graceful degradation, and achieving consistency. Requests also inevitably fail within distributed systems — connections will drop and timeout at some point, especially for transactions that consist of multiple network requests.
There are three different common techniques used in distributed systems to achieve eventual consistency: read repair, write repair, and asynchronous repair. There are benefits and trade-offs to each approach. Our payments system uses all three in various functions.
Asynchronous repair involves the server being responsible for running data consistency checks, such as table scans, lambda functions, and cron jobs. Additionally, asynchronous notifications from the server to the client are widely used in the payments industry to force consistency on the client side. Asynchronous repair, along with notifications, can be used in-conjunction with read and write repair techniques, offering a second line of defense with trade-offs in solution complexity.
Our solution in this particular post utilizes write repair, where every write call from the client to the server attempts to repair an inconsistent, broken state. Write repair requires clients to be smarter (we’ll expand on this later), and allows them to repeatedly fire the same request and never have to maintain state (aside from retries). Clients can thus request eventual consistency on-demand, giving them control over the user experience. Idempotency is an extremely important property when implementing write repair.
For an API request to be idempotent, clients can make the same call repeatedly and the result will be the same. In other words, making multiple identical requests should have the same effect as making a single request.
This technique is commonly used in billing and payment systems involving money movement — it is crucial a payments request gets processed completely exactly once (also known as “exactly-once delivery”). Importantly, if a single operation to move money is called multiple times, the underlying system should move money at most once. This is critical for Airbnb Payments APIs in order to avoid multiple payouts to the host, and even worse, multiple charges to the guest.
By design, idempotency safely allows multiple identical calls from clients using an auto-retry mechanism for an API to achieve eventual consistency. This technique is common among client-server relationships with idempotency, and something that we use in our distributed systems today.
At a high level, the below diagram illustrates some simple, example scenarios with duplicate requests and ideal idempotent behavior. No matter how many charge requests are made, the guest will always be charged at most once.
Guaranteeing eventual consistency for our payments system is of the utmost importance. Idempotency is a desirable mechanism to achieve this in a distributed system. In an SOA world, we will inevitably run into problems. For example, how would clients recover if it failed to consume the response? What if the response was lost or the client timed out? What about race-conditions resulting in a user clicking “Book” twice? Our requirements included the following:
- Instead of implementing a single, custom solution specific for a given use case, we needed a generic yet configurable idempotency solution to be used across Airbnb’s various Payments SOA services.
- While SOA-based payment products were being iterated on, we couldn’t compromise on data consistency since this would impact our community directly.
- We needed ultra low latency, so building a separate, stand-alone idempotency service wouldn’t be sufficient. Most importantly, the service would suffer from the same problems it was originally intended to solve.
- As Airbnb is scaling its engineering organization using SOA, it would be highly inefficient to have every developer specialize on data integrity and eventual consistency challenges. We wanted to shield product developers from these nuisances to allow them to focus on product development and iterate faster.
Additionally, considerable trade-offs with code readability, testability and ability to troubleshoot were all considered non-starters.
We wanted to be able to identify each incoming request uniquely. Additionally, we needed to accurately track and manage where a specific request was in its lifecycle.
We implemented and utilized “Orpheus”, a general-purpose idempotency library, across multiple payments services. Orpheus is the legendary Greek mythological hero who was able to orchestrate and charm all living things.
We chose a library as a solution because it offers low latency while still providing clean separation between high-velocity product code and low-velocity system management code. At a high level, it consists of the following simple concepts:
- An idempotency key is passed into the framework, representing a single idempotent request
- Tables of idempotency information, always read and written from a sharded master database (for consistency)
- Database transactions are combined in different parts of the codebase to ensure atomicity, using Java lambdas
- Error responses are classified as “retryable” or “non-retryable”
We’ll detail how a complex, distributed system with idempotency guarantees can become self-healing and eventually consistent. We’ll also walk through some of the trade-offs and additional complexities from our solution that one should be mindful of.
One of the key requirements in an idempotent system is to produce only two outcomes, success or failure, with consistency. Otherwise, deviations in data can lead to hours of investigation and incorrect payments. Because databases offer ACID properties, database transactions can be effectively used to atomically write data while ensuring consistency. A database commit can be guaranteed to succeed or fail as a unit.
Orpheus is centered around the assumption that almost every standard API request can be separated into three distinct phases: Pre-RPC, RPC, and Post-RPC.
An “RPC”, or Remote Procedural Call(s), is when a client makes a request to a remote server and waits for that server to finish the requested procedure(s) before resuming its process. In the context of payments APIs, we refer to an RPC as a request to a downstream service over a network, which can include external payments processors and acquiring banks. In brief, here is what happens in each phase:
- Pre-RPC: Details of the payment request are recorded in the database.
- RPC: The request is made live to the external processor and the response is received. This is a place to do one or more idempotent computations or RPCs (for example, query service for the status of a transaction first if it’s a retry-attempt).
- Post-RPC: Details of the response from the external service are recorded in the database, including its successfulness and whether a bad request is retryable or not.
To maintain data integrity, we adhere to two simple ground rules:
- No service interaction over networks in Pre and Post-RPC phases
- No database interactions in the RPC phases
We essentially want to avoid mixing network communication with database work. We’ve learned the hard way that network calls (RPCs) during the Pre and Post-RPC phases are vulnerable and can result in bad things like rapid connection pool exhaustion and performance degradation. Simply put, network calls are inherently unreliable. Because of this, we wrapped Pre and Post-RPC phases in enclosing database transactions initiated by the library itself.
We also want to call out that a single API request may consist of multiple RPCs. Orpheus does support multi-RPC requests, but in this post we wanted to illustrate our thought process with only the simple single-RPC case.
As shown in the example diagram below, every database commit in each of the Pre-RPC and Post-RPC phases is combined into a single database transaction. This ensures atomicity — entire units of work (here the Pre-RPC and Post-RPC phases) can fail or succeed as a unit consistently. The motive is that the system should fail in a manner it could recover from. For example, if several API requests failed in the middle of a long sequence of database commits, it would be extremely difficult to systematically keep track of where each failure occurred. Note that the all network communication, the RPC, are explicitly separated from all database transactions.
A database commit here includes an idempotency library commit and application layer database commits, all combined in the same code block. Without being careful, this could actually begin to look really messy in real code (spaghetti, anyone?). We also felt that it shouldn’t be the responsibility of the product developer to call certain idempotency routines.
Thankfully, Java lambda expressions can be used to combine multiple sentences into a single database transaction seamlessly, with no impact to testability and code readability.
Here is an example, simplified usage of Orpheus, with Java lambdas in action:
At a deeper level, here is a simplified excerpt from the source code:
The separation of these concerns does offer some trade-offs. Developers must use forethought to ensure code readability and maintainability as other new ones constantly contribute. They also need to consistently evaluate that proper dependencies and data get passed along. API calls are now required to be refactored into three smaller chunks, which could arguably be restrictive on the way developers write code. It might actually be really difficult for some complex API calls to effectively be broken down into a three-step approach. One of our services has implemented a Finite State Machine with every transition as an idempotent step using StatefulJ, where you could safely multiplex idempotent calls in an API call.
With a framework like Orpheus, the server should know when a request is safe to retry and when it isn’t. For this to happen, exceptions should be handled with meticulous intention — they should be categorized as either “retryable” or “non-retryable”. This undoubtedly adds a layer of complexity for developers and could create bad side-effects if they are not judicious and prudent.
For example, suppose a downstream service was offline temporarily, but the exception raised was mistakenly labeled as “non-retryable” when it really should have been “retryable”. The request would be “failed” indefinitely, and subsequent retry requests would perpetually return the incorrect non-retryable error. Conversely, double payments could occur if an exception was labeled “retryable” when it actually should have been “non-retryable” and requires manual intervention.
In general, we believe unexpected runtime exceptions due to network and infrastructure issues (5XX HTTP statuses) are retryable. We expect these errors to be transient, and we expect that a later retry of the same request may eventually be successful.
We categorize validation errors, such as invalid input and states (for example, you can’t refund a refund), as non-retryable (4XX HTTP statuses) — we expect all subsequent retries of the same request to fail in the same manner. We created a custom, generic exception class that handled these cases, defaulting to “non-retryable”, and for certain other cases, categorized as “retryable”.
It is essential that request payloads for each request remain the same and are never mutated, otherwise it would break the definition of an idempotent request.
There are of course more vague edge cases that need to be handled with care, such as handling a
NullPointerException appropriately in different contexts. For example, a
null value returned from the database due to a connectivity blip is different from an erroneous
null field in a request from a client or from a third party response.
As alluded to at the beginning of this post, the client must be smarter in a write repair system. It must own several key responsibilities when interacting with a service that uses an idempotency library like Orpheus:
- Pass in a unique idempotency key for every new request; reuse the same idempotency key for retries.
- Persist these idempotency keys to the database before calling the service (to later use for retries).
- Properly consume successful responses and subsequently unassign (or nullify) idempotency keys.
- Ensure mutation of the request payload between retry attempts is not allowed.
- Carefully devise and configure auto-retry strategies based on business needs (using exponential backoff or randomized wait times (“jitter”) to avoid the thundering herd problem).
Choosing an idempotency key is crucial — the client can choose either to have request-level idempotency or entity-level idempotency based on what key to use. This decision to use one over the other would depend on different business use-cases, but request-level idempotency is the most straightforward and common.
For request-level idempotency, a random and unique key should be chosen from the client in order to ensure idempotency for the entire entity collection level. For example, if we wanted to allow multiple, different payments for a reservation booking (such as Pay Less Upfront), we just need to make sure the idempotency keys are different. UUID is a good example format to use for this.
Entity-level idempotency is far more stringent and restrictive than request-level idempotency. Say we want to ensure that a given $10 payment with ID
1234 would only be refunded $5 once, since we can technically make $5 refund requests twice. We would then want to use a deterministic idempotency key based on the entity model to ensure entity-level idempotency. An example format would be
“payment-1234-refund”. Every refund request for a unique payment would consequently be idempotent at the entity-level (
Multiple identical requests can be fired due to multiple user-clicks or if the client has an aggressive retry policy. This could potentially create race conditions on the server or double payments for our community. To avoid these, API calls, with the help of the framework, each need to acquire a database row-level lock on an idempotency key. This grants a lease, or a permission, for a given request to proceed further.
A lease comes with an expiration to cover the scenario where there are timeouts on the server side. If there is no response, then an API request can be retryable only after the current lease has expired. The application can configure the lease expiration and RPC timeouts according to their needs. A good rule of thumb is to have a higher lease expiration than the RPC timeout.
Orpheus additionally offers a maximum retryable window for an idempotency key to provide a safety net in order to avoid rogue retries from unexpected system behavior.
We also record responses, to maintain and monitor idempotent behavior. When a client makes the same request for a transaction that has reached a deterministic end-state, such as a non-retryable error (validation errors, for example) or a successful response, the response is recorded in the database.
Persisting responses does have a performance trade-off — clients are able to receive quick responses on subsequent retries, but this table will have growth proportional to the growth of the application’s throughput. This table can quickly become bloated the table if we’re not careful. One potential solution is to periodically remove rows older than a certain timeframe, but removing an idempotent response too early has negative implications, too. Developers should also be wary not to make backwards-incompatible changes to the response entities and structure.
When reading and writing idempotency information with Orpheus, we chose to do this directly from the master database. In a system of distributed databases, there is a tradeoff between consistency and latency. Since we couldn’t tolerate high latency or reading uncommitted data, using master for these tables mades the most sense for us. In doing so, there is no need for using a cache or a database replica. If a database system is not configured for strong read-consistency (our systems are backed by MySQL), using replicas for these operations could actually create adverse effects from an idempotency perspective.
For example, suppose a payments service stored its idempotency information on a replica database. A client submits a payment request to the service, which ends up being successful downstream, but the client doesn’t receive a response due to a network issue. The response, currently stored on the service’s master database, will eventually be written to the replica. However, in the event of replica lag, the client could correctly fire an idempotent retry to the service and the response would not be recorded to the replica yet. Because the response “does not exist” (on the replica), the service could mistakenly execute the payment again, resulting in duplicate payments. The below example illustrates how just a few seconds of replica lag could cause significant financial impact to Airbnb’s community.
When using a single master database for idempotency, it was pretty clear that scaling would undoubtedly and quickly becomes an issue. We alleviated this by sharding the database by idempotency key. The idempotency keys we use have high cardinality and even distribution, making them effective shard keys.
Achieving eventual consistency does not come without introducing some complexity. Clients need to store and handle idempotency keys and implement automated retry-mechanisms. Developers require additional context and must be surgically precise when implementing and troubleshooting Java lambdas. They must be deliberate when handling exceptions. Additionally, as the current version of Orpheus is battle-tested, we are continuously finding things to improve on: request-payload matching for retries, improved support for schema changes and nested migrations, actively restricting database access during RPC phases, and so on.
While these are considerations at top of mind, where has Orpheus gotten Airbnb Payments so far? Since the launch of the framework, we have achieved five nines in consistency for our payments, while our annual payment volume has simultaneously doubled (read this if you’d like to learn more on how we measure data integrity at scale).
If you’re interested in working on the intricacies of a distributed payments platform and helping travelers around the world belong anywhere, the Airbnb Payments team is hiring!
Shoutout to Michel Weksler and Derek Wang for their thought leadership and architectural philosophy on this project!