Move fast and migrate things: how we automated migrations inĀ Postgres

By Vineet Gopal

At Benchling, we’re building a platform to help scientists do research. Hundreds of thousands of scientists across academia and enterprise clients use Benchling to store and analyze scientific data, assemble DNA sequences, and design experiments.

Over the last three years, we’ve built out several new product modules as we’ve grown our user base over 10×, so we’ve been constantly iterating on our product and data model. In that time, we’ve run over 800 different migrations.¹ We host a separate database for many of our larger customers, so these migrations have also been run across 100s of databases.

We’ve spent the last two years automating and improving our migration process to address key issues we were having — manual intervention, backwards compatibility, correctness, and performance. This post dives into the problems we ran into and highlights some learnings and tools we made along the way.

(See Lessons learned for the summary)

Our database is Postgres (9.6). We use the ORM SQLAlchemy and its companion migration tool Alembic.

We have two ways of defining the schemas in our database: declaratively (SQLAlchemy models) and imperatively (Alembic migrations). To change a schema, we must first change the model code, then write a migration to match those changes.

Example:

# SQLAlchemy model
class User(Model):
# User's research institution
institution = db.Column(db.String(1024))

# Alembic migration
op.add_column("users", Column("institution", String(1024)))

Alembic has a feature to auto-generate this migration based on model definitions. While it doesn’t support all types of migrations, such as some check constraints, this saved us a lot of time. So we started writing migrations by auto-generating them, completing the schema changes that Alembic didn’t support, and sending for code review.

Both migration author and reviewer checked the migration was correct and backward-compatible with the old model code. They also checked that it was safe: limited to only inexpensive operations so we wouldn’t need to schedule downtime with our customers. We used references such as Braintree’s post on how to perform schema-changing operations this way.

After passing code review, the migration was ready to merge and run.

We used to run migrations manually. But that didn’t scale with hundreds of databases, so we designed running migrations into our deployment process.

We decided to support automatically running migrations both before and after deploying the server code that included the models. Additive changes like adding a column were made in the pre-deploy migrations (before the model code that needed it was deployed). Destructive changes like removing a column were made in the post-deploy migrations (so code would stop using the column before it was removed from the database).

When deploying a new set of commits to production, we’d:

  1. Check production database version counter (via Alembic) to determine which pre-deploy migrations from the new commits are missing. Run each migration individually inside a transaction.
  2. Deploy code to product servers.
  3. Repeat the check for post-deploy migrations, and run them.

It worked well — until it didn’t, and we had to iterate.

Even when a migration was entirely correct and safe, it could still cause downtime. Investigating the running and waiting queries with pg_stat_activity revealed the reason to us: locking.

When we first wrote the migration to add the nullable institution column to the users table, we determined it was safe because adding a nullable column is backward-compatible and fast. However, when we ran the migration, user requests started failing. The investigation showed that it had interleaved with 2 transactions that were reading from the same table:

The first SELECT was in a long-running transaction in a cron job that ran for 5 minutes. The additive migration was waiting to acquire an ACCESS EXCLUSIVE lock on the same table, so it was blocked. The second SELECT in a user request was waiting to read from the table, so it was blocked by the migration.

Even though the add column operation is fast, Postgres was waiting for the exclusive lock. Normally the SELECTs between a cron job and user request wouldn't conflict, but in this case, all subsequent user requests were blocked for minutes until the long-running transaction and migration were done. Our “safe” migration still caused downtime, so we looked for a fix we could build into the deployment process.

Postgres has two configuration options that we saw as fail-safes for a runaway migration:

  1. lock_timeout: the maximum amount of time the transaction will wait while trying to acquire a lock before erroring and rolling back
  2. statement_timeout: the maximum amount of time any statement in the transaction can take before erroring and rolling back

We set a default lock_timeout of 4 seconds and statement_timeout of 5 seconds for migrations. This limited migrations from blocking user requests by waiting for too long or running expensive queries. Neither helped the migration succeed, but they helped ensure the migration failed gracefully without affecting our users.

Even though we had a deploy system that was running migrations automatically, we still found ourselves having to manually intervene fairly often to get them through. These manual interventions fell into two categories.

After adding migration timeouts, our migrations were safe to run and could fail gracefully without causing user issues — but failing meant an engineer had to deal with it. Migrations that were touching hot tables or running when cron jobs or lots of users were online were likely to run into lock timeout issues and fail. We usually investigated and attempted one of the following strategies to get each through:

  1. Check if the migration was safe to rerun, and if so, retry the migration manually to see if we just got “unlucky”
  2. Investigate what locks we were being blocked by, and possibly shut down a cron system for some period of time
  3. Wait till a better time (e.g. less usage) to run the migration

We found that #2 and #3 happened quite often, and put some time into making our P99 request time significantly lower. However, even with those improvements, we still found ourselves manually rerunning migrations on hot tables a few times before they succeeded.

So we built out the infrastructure to automatically retry “safe migrations”. At the surface, this was scary — having an automated system rerunning migrations that are touching the core data of our application was not a light decision. We added a few safeguards to help us be safe:

  • We only automatically retry migrations that have no intermediate commits (since the entire migration is in a transaction automatically, this means failed migrations are safe to retry).
  • We wait 2 minutes between retries to give any systems time to recover (and engineers a bit of time to respond in case anything goes horribly wrong).
  • We retry at most 10 times.

In 3 months of running this in production, we have seen all of our migrations go through successfully without any manual work.

While post-deploy migrations gave full flexibility to the developer (e.g. dropping a column in a single deploy), they also resulted in these problems:

  • Developers had to specify the right type for each migration, which was one more opportunity to make a mistake
  • We often had to reorder migrations when a deploy included multiple migrations, and a “post-deploy” migration appeared before “pre-deploy” migration³

We found that with just pre-deploy migrations, we were able to remove a lot of this complexity and room for error. We removed post-deploy migrations from our system.

Despite auto-generation and automatic retries, it was still entirely possible for a developer to (accidentally) write and run a backward-incompatible migration. Since the pre-deploy migration changed the schema before the code was deployed, user requests hit the incompatibility between the post-migrated database and old server code, and failed. We wanted to build a system that was more resilient and easier to write safe code.

The most common example was removing a column. After we implemented first-class support for teams to group users on Benchling, we wanted to remove the team_name column from the User model. Because all migrations were now pre-deploy, we needed to remove this in 2 deploy cycles:

  1. Remove all usages of the column in code
  2. Remove the column with a migration

(Otherwise the column would be removed too early, while existing code still depended on it.)

An engineer searched the codebase and removed all* usages of the column. We then ran the migration to drop the column in a separate deploy, and every query to the users table failed until the new code was deployed a few minutes later. The migration that we believed to be safe was actually backward-incompatible.

* We did have one remaining reference to the team_name column, on the User model itself. We have tests to ensure our database and SQLAlchemy models stay in sync, covered in a later section.²

However, because the team_name column was still on the User model, SQLAlchemy automatically used it in SELECTs and INSERTs of the model. When reading a user model, it tries to query the column. When creating a user, it tries to insert null for it. So, while the author thought they were safely removing the column with the migration, they actually weren't because its declaration in the SQLAlchemy model constituted a usage.

SQLAlchemy has two configuration options to truly remove its usage of the column. deferred tells it to stop querying the column. evaluates_none tells it to stop inserting null for it. But we didn't want the author or reviewer to have to remember these every time.

To make it easy for a developer to safely remove columns, we decided to write some abstractions on top of SQLAlchemy to help write backward-compatible migrations.

To remove columns, we made a simple column decorator deprecated_column that ensures the column is unused so it can be safely removed. It configures SQLAlchemy to ignore the column with deferred and evaluates_none. But more importantly, it checks every outgoing query and errors if it references the column. Thus tests for code that try to use the column fail. Next time, we simply decorated the column to remove with deprecated_column, removed usages it caught, and deployed that, then wrote a backward-compatible migration to safely remove it.

We also made it easier to rename columns. The renamed_to option we implemented automatically generates SQL triggers to copy values between the old and new columns on value change. This means renaming columns only requires 2 deploy cycles:

  1. Create the column with the new name with a migration, change all usages of the old column to new, decorate the old column with deprecated_column(renamed_to=new_column)
  2. Remove the old column with a migration

(Note: we automatically add triggers that copy from the old column to the new column, and from the new column to the old column. Maintaining equality between both columns is very important, since the deploy cycle may include some servers writing to the new column while others are reading from the old column.)

Our strategy was to extend the ORM to simplify writing backward-compatible migrations, and this worked well for us. Another strategy we plan to employ is automatically testing for backwards-compatibility. Since the post-migrated database and old code must always be compatible, we can test the same setup: running the full test suite of the pre-migration code against the post-migrated database. We confirmed with Quizlet that this strategy worked for them.

Auto-generation, migration timeouts, automatic retries, and compatibility checks ensured migrations were easy to write and run without affecting user requests. However, correctness still lay in the eyes of the author and reviewer. It was still possible to make schema changes that were wrong: the migration didn’t exactly match the schema, or the changes themselves didn’t follow our database best practices.

Same as our initial setup, we wrote migrations by auto-generating them and completing the unsupported changes. Save for manually checking, we didn’t know that the ORM schema matched the migrations, even though it was critical that the two are in sync.

We tested the same setup to check they matched. We initialized a database from SQLAlchemy models and a database by running all migrations from a base schema, and compared their schemas to verify no differences.

In addition to testing that the migrations are correct, we also want to avoid mistakes when making or changing models. We defined mistakes as a violations of invariants we hold true across our database. These are a set of living rules that include

  • every foreign key column must covered by an index
  • there are no redundant indexes
  • all joined-table-inheritance tables have a trigger to delete parent row when the child row is deleted

SQLAlchemy’s inspection API was powerful enough to automate these checks. We wrote tests to check that each invariant holds for each table in the database. These checks did not cover all possible mistakes for schema changes, but enabled us to declare the rules to follow in a programmatic way.

We have not automated every part of writing a migration. In particular, we still need code reviews for something as critical as a migration. Every migration must be reviewed by a normal reviewer and someone on a short list of approved migration reviewers. In practice, however, as a migration reviewer myself, I usually don’t have any comments — these tests took care of most of the comments I usually had.

  • Explicitly setting a lock_timeout and statement_timeout when running migrations prevents accidental downtime. (Consider doing this for all transactions.)
  • Automatically running migrations in the deploy process saves a lot of engineering hours. Automatically retrying these migrations on lock timeout increases our probability of success without hurting the system.
  • Automatically testing that migrations match the declarative data model prevents schema correctness issues when using SQLAlchemy and Alembic.
  • Automatically testing database invariants that we care about, like indexes on foreign keys, allows us to codify what would otherwise be on a reviewer checklist.
  • Automatically generating migrations saves developers a lot of time. Alembic comes with this out of the box.
  • Building tools to handle backward-incompatible changes, like deprecated_column and renamed_to, allows developers to make changes faster.

As a small engineering team, we’ve found that automating the majority of writing and running migrations has allowed us to iterate quickly; automated tests before deployment have prevented us from making mistakes; and automated safeguards during deployment have prevented us from causing downtime. While there are still manual parts like code reviews, we have found that the current balance of automation and manual work has allowed us to move fast without breaking things — while migrating things.

Taking a step back, iteration speed has been one of the most important success factors for a team like ours. The ability to get something out into the wild, receive feedback, and make improvements quickly is critical for us to make good product. While making changes to the user interface and internal API is often straightforward (not always!), changing fundamental data models was quite hard at one point. But with the right tools, we were able to improve our iteration speed for one part of this journey — database migrations.

If you’re interested in working on problems like this, we’re hiring!

Discuss on Hacker News

Thanks to Damon, Daniel, Ray, Scott, and Somak for reading drafts of this.

¹ This may sound like quite a lot of migrations (almost one per day) — it is! There are a lot of factors here that affected this:

  • We have a system that makes running migrations easier, so we often only “add functionality as we need it” vs. doing it once all upfront.
  • We generally store things in fully normalized formats (and have generally avoided e.g. JSON columns without strict schemas), so most data model changes result in a migration.
  • We have a very broad product that matches the complex and ever-changing nature of life sciences.

² We could remove it and allow ORM and migration schemas to drift out-of-sync, but this option doesn’t work as well to renaming columns.

³ Migrations must be performed in a particular order. If there are two migrations (one to drop a column, one to add a column) and the drop-column migration came first, there is no way to run that in a single deploy cycle without causing downtime. We could either split this into multiple deploys, or reorder the migrations.