June 10, 2021
Amazon’s Aurora MySQL is Plaid’s primary relational database. It's proven to be a reliable and scalable storage layer that backs some of our most critical systems.
Back in 2019, however, we noticed severe performance degradation on the primary instance in one of our Aurora clusters. Our oncall runbooks suggested checking load on the primary instance using our Prometheus metrics and RDS performance insights, yet everything looked normal. We kept digging, and eventually found a clue: an errant long-running transaction had been open on one of our read replicas for weeks.
Surprisingly, killing the transaction mitigated the performance degradation—but we didn’t understand why. Soon after, we began to notice echoes of the same issue. Every time an Airflow job would ETL data from one of our read replicas to our data warehouse, we would see a small performance hit. This behavior seemed to violate one of our core assumptions about our system based on our knowledge of non-Aurora MySQL setups: it should not be possible, we thought, for read replica load to have any impact on writer performance.
It was clear that we needed to read the manual: we dug into the Aurora DB design paper, Aurora documentation, MySQL documentation, and several articles on databases. We were eventually able to identify the cause of the degradation and a handful of best practices to avoid it. But before we get to the solution, let's get back to databases 101.
Transaction isolation in vanilla MySQL
In production databases, there are often hundreds or thousands of operations being performed concurrently. If all of these operations worked naively with the same data, application developers would have to defend against an intractable number of potential race conditions, and working with these databases would be prohibitively difficult. Multiversion concurrency control, or MVCC, significantly simplifies this picture by allowing each operation to behave as if it is the only operation running on the database.
To achieve this end, MVCC introduces the concept of database transactions. When an application developer begins a new transaction, they can safely assume that queries within the transaction can only read changes from transactions that were already completed when the transaction in question began. Consider the following example:
Figure: a set of transactions and their operations over time.
The “transaction isolation level” of a database controls how aggressively it will protect against inconsistent state in this scenario. By default, Aurora replicas will use the REPEATABLE READ isolation level. In this example, REPEATABLE READ means that transaction 2 should not be able to read the state modified by operation 1, because transaction 2 starts before transaction 1 is committed. Transaction 3, on the other hand, should be able to read the state modified by operation 1, but should not be able to see the partial transaction state caused by operations 2, 3, or 4.
In practice, these transactional guarantees are implemented using “undo logs”. When a transaction updates a row, an undo log entry is generated which can be used to recreate the prior state of the row by undoing that update. Returning to the example above: if transaction 3 attempts to read a row modified by operation 2, it will use the undo log to read the prior state of that row, from before transaction 2 started.
At any time, any transaction could read any row in the database. This requires that the database maintain undo logs at least as far back as the earliest time it might need to read data from: the start time of the oldest open transaction. One relatively well-understood consequence of this requirement is that long-running transactions can cause performance degradation for concurrent transactions on the same instance.
This impact can be especially significant on the performance of queries that are heavily dependent on indices. This is because, when a transaction deletes a row in a table, the row can’t be physically removed from the database until there are no remaining transactions that started when that row was extant. The same is true of the index entries associated with the row in question.
To better understand this impact, consider a transaction that performs a range query over a hot portion of an index which changes frequently. Rows which have been deleted but not yet physically purged still occupy space in the index: as such, the query must spend more time traversing the index (which is organized as a B-tree) to assemble its results, as it must ignore index entries which are “useless” (because they correspond to rows which are not extant from the perspective of the transaction in which the query is executed).
Our use case was especially susceptible to this particular failure mode: our system depended on being able to execute performant queries over an indexed table in which rows were frequently being inserted, updated, and deleted. We believed (incorrectly!) that this performance concern could be mitigated by avoiding long-running queries on our primary instance. If our system were built on vanilla MySQL, this would have been the case, as the performance impacts of long-running queries are contained to the instance on which the transaction is operating.
In the next section, we show that this isn't always the case for Aurora MySQL.
Persistence and Replication in Vanilla vs. Aurora MySQL
While Aurora and vanilla MySQL agree on the requirements of MVCC, they differ significantly in their persistence and replication models, and as a result differ in their implementation of MVCC.
Compute, RAM, and storage are collocated on the same instance in vanilla MySQL. When a transaction updates a row, a few things happen: the row is updated in memory, an undo log entry is generated and a “redo log” entry describing the change is persisted to disk. (Redo logs are not hugely important to our discussion of vanilla MySQL, but they play a much more prominent role in Aurora MySQL.)
MySQL uses “pages” as the unit of data that is read from and written to persistent storage. When a row in a page is accessed by a query, the database engine reads the whole page into memory, where it can be operated on performantly; if that page is updated, it must be written back to persistent storage at some point. The database engineers periodically flushes in-memory pages to disk by way of a background CHECKPOINT operation, which is primarily how data is persisted. Redo logs are a backup mechanism, intended to minimize data loss during a crash scenario: the database can "redo" operations on top of the latest checkpoint to bring the database back to where it was before the crash.
Replication in vanilla MySQL is accomplished using “binary logs” which describe modifications to data. These are sent from the primary instance to replica instances. The replicas use the binary log to duplicate changes made on the primary instance, and persist them to (replicated) durable storage. While it is possible for replicas to fail to keep pace with the binary log and fall behind the primary (a phenomenon known as “replication lag”), it is not possible for replica queries to materially impact the performance of the primary instance, because the binary log is a one-way communication channel from primary to replica.
Figure: the various elements of the vanilla MySQL persistence and replication scheme, shown here with two read replicas.
One of the chief selling points of Aurora MySQL is that compute (query execution) and storage (durably persisting data to disk) happen on different instances. Developers provision compute manually, while Aurora automatically provisions and scales its own storage. This decoupling makes it easier to tailor a deployment’s provisioning to a particular workload, but is a significant departure from vanilla MySQL, in which these responsibilities are collocated on individual instances.
This difference has significant implications for both persistence and replication: if an Aurora primary instance is not responsible for flushing modified pages to disk, how are changes persisted durably? Using redo logs. While redo logs are a relatively minor backup mechanism in vanilla MySQL, in Aurora, redo logs take a much more central role: as the Aurora designers put it, “the log is the database”.
In fact, both persistence and replication in Aurora are accomplished using the redo log.
Persistence: the primary instance sends redo logs to the storage layer, which consists of some number of storage nodes. These storage nodes are responsible for aggregating redo log entries, achieving quorum, and persisting the resulting changes to disk.
Replication: the primary instance also sends redo logs to any replicas. This allows replicas to guarantee that their state is up-to-date for pages which have been read into memory without needing to consult the storage layer. Note that redo logs will only affect in-memory pages – readers do not persist any pages or redo logs to disk.
Crucially, whenever the primary or any replica instance needs to retrieve a page that it doesn’t already have in memory, they must consult the same shared storage layer. The transactional guarantees of MVCC are satisfied using undo log entries, which reside on the storage layer. Put explicitly: in Aurora, readers and writers share the same set of extant undo log entries.
Figure: the various elements of the Aurora MySQL persistence and replication scheme, shown here with two read replicas.
Note: This omits some detail around the storage layer, which is actually comprised of storage compute nodes that translate redo logs into pages for writing to the volume
Finally, we can return to our old friend, the undo log. Like other data, undo log entries must be paged in and out of memory as required by the transactions operating on the database.
This all leads us to a surprisingly simple conclusion: because Aurora MySQL primary and replica instances share a storage layer, they share a set of undo logs. This means that, for a REPEATABLE READ isolation level, the storage instance must maintain undo logs at least as far back as could be required to satisfy transactional guarantees for the primary or any read replica instance. Long-running replica transactions can negatively impact writer performance in Aurora MySQL—finally, an explanation for the incident that spawned this investigation.
The same scenario plays out differently in vanilla MySQL because of its different model for undo logs.
Vanilla MYSQL: there are two undo logs – one on the writer, and one on the reader. The performance impact of an operation that prevents the garbage collection of undo log records will be isolated to either the writer or the reader.
Aurora MySQL: there is a single undo log that is shared between the writer and reader. The performance impact of an operation that prevents the garbage collection of undo log records will affect the entire cluster.
Figure: performance impacts of long-running read replica transactions in vanilla. MySQL and Aurora MySQL. Degraded performance indicated with red.
We had an enlightening discussion with our AWS support team and an Aurora engineer (thanks AWS!) in which we were able to confirm our understanding of the issue and get a full rundown of the suggested mitigation techniques:
Reduce the transaction isolation level for replica instances. Aurora allows read replicas to be configured with different isolation levels than primary instances; if your read workload allows it, reducing replica isolation levels could be one of the simplest ways to partially or entirely mitigate this problem. In fact, the Aurora documentation suggests using the READ COMMITTED isolation level to avoid this specific performance issue—but note that REPEATABLE READ isolation is still the default for replicas! (Unfortunately for us, the READ COMMITTED option was actually introduced a few months after we first encountered our performance problems, so identifying and mitigating the issue was a bit trickier at the time.)
Keep tabs on the “rollback segment history list length”. This length will increase as undo logs accumulate due to long-running queries. This metric is emitted as “RollbackSegmentHistoryListLength” in CloudWatch.
Perform potentially long-running reader queries on binlog replicas. It is possible to configure binlog replication in Aurora. Just as in vanilla MySQL, queries on binlog Aurora replicas will not negatively impact primary performance.
Export snapshots to S3 as Apache Parquet files. Depending on the freshness requirements of your analytics use case, periodic DB snapshots may be acceptable. Aurora supports exporting snapshots as Parquet files which can be analyzed using Amazon Athena, EMR, or SageMaker.
Use a clone to handle long-running reads. This allows you to continue using a MySQL tool set without worrying about performance impact of your read queries.
When something doesn't work the way we expect it to, Plaid engineers take pride in digging in deeper. Do you do the same? We're hiring!