Sooner or later every project will need to migrate production data, not only the schema. There are many different approaches in the wild – most of them overly complex. In this article we’ll tackle the problem using vanilla Rails.
Sane Data Migrations
In order to migrate data effectively we need to overcome several challanges:
- Accessing rows already present in the table.
- Avoiding breaking the migration with model and schema changes.
- Accessing rows from related tables.
- Updating columns.
All this is seemingly trivial but most projects I’ve seen got at least one of these wrong. The method we’ll describe is a good starting point even if your project is facing additional challanges.
The gist of the idea: create Active Record models inside the migration to access the required tables. Let’s illustrate this with an example.
Imagine we’re working on a multi-user blog system. Posts are currently accessed by ID. We’d like to start accessing them by a user-friendly slug based on the post title and author. An example title might look like
In order to migrate existing posts we need to do the following:
posts. We must allow
NULLvalues for now as existing posts lack slugs.
- Provide slugs for existing posts.
Let’s take a look at the migration first and discuss why things are the way they are afterwards:
class AddSlugsToPosts < ActiveRecord::Migration[5.2] # First, we define migration models to manipulate the data. class Post < ActiveRecord::Base belongs_to :user end class User < ActiveRecord::Base has_many :posts end def up # Second, add the column and make it NULL-able. add_column :posts, :slug, :string, null: true # Third, ensure the new column is picked up by the model. Post.reset_column_information! Post.includes(:user).find_each do |post| # Fourth, we set the slug and save the post. post.slug = post_slug(post) post.save! end # Fifth, make the column non-NULL. change_colum_null :posts, :slug, false end def down remove_column :posts, :slug end def post_slug(post) # IMPORTANT: Copy the title-to-slug algorithm from the production model # that was used AT THE TIME OF WRITING THE MIGRATION. end end
This migration may raise several questions:
- Why do we need to define models inside the migration?
- Isn’t copying the slug alorithm to the migration a violation of the DRY principle?
Let’s tackle these questions one-by-one.
First, we absolutely must not use the production
Post model as it would put us at risk of:
- Unknowingly executing callbacks that can trigger API calls, create other records, etc.
- Failing to save records due to validation errors.
- Breaking the migration after deploying changes to
Postthat are incompatible with the migration. For example, renaming the model would break the migration.
It’s also important that using production models in migrations may prevent us from restoring a backup. For instance, imagine we make a backup on Monday, run the migration on Tuesday, and delete the
Post model on Wednesday. Then on Thursday we need to restore Monday’s backup. We won’t be able to migrate to the most recent database schema as the Tuesday migration will fail because
Post is no longer present in the code base.
I recommend naming models after tables. For example, if the
Post model is backed up by the
articles table then the migration should define
Article instead of
Post. We should focus on the schema and the data while writing the migration and any non-obvious mapping from tables to models is a distraction.
After defining migration models, we can manipulate the schema. The rules of safe and efficient migrations are a topic for another article. In the example above, we need to make the column
nil-able before providing slugs for existing posts.
We also need to keep in mind that Active Record caches schema information so whenever we modify it we must force a refresh by calling
.reset_column_information!. It’s best to always call it after modifying a table even if theoretically our migration would work without it. The call is effectively free and there’s one less thing for us to think about.
After adding the column, we iterate over existing posts and set their slugs. What may seem controversial here is that we copied
post_slug to the migration. The reason for that is the same as the reason for having migration models – avoiding breaking migration with production code changes. As mentioned previously, this is especially important when restoring backups.
Last but not least, we are not violating DRY by replicating that code in the migration. It’s easy to see why if we spell out the definition of
post_slug in the migration and the production model:
Postis the algorithm we’re using now.
post_slugin the migration is the algorithm we were using at the time migration was written.
The fact that the two definitions coincide while we’re writing the migration doesn’t mean they’re identical. They are not and trying to DRY up code like that is a mistake.
What Not to Do
The method above is sufficient in most cases experienced in practice. Let’s discuss what not to do before closing the topic:
- One-off rake tasks or scripts as they make it impossible to automate deployments and make it difficult to restore backups as we’d need to go through them again and run them and migrations in the right order. It doesn’t sound like something to do when restoring a backup.
- Custom commands run from a Rails console as this is even worse than rake tasks. There’s no auditiability of the changes plus we won’t remember what we did 2 weeks ago when restoring a backup.
- Separate data migration tracks (e.g. as implemented by
data-migrate). They increase complexity without offering any benefits. In the end we need to run all the migrations in the right order and splitting them into multiple directories doesn’t help.
All these approaches require comparable effort, provide no benefit over our approach, and have serious downsides.
Migrating the data, as opposed to the schema, is a frequent problem that is often solved in overly complex ways. Using a regular migration with migration-specific model definitions allows us to use regular migrations, keep deployments automated and avoid complexity of third-party solutions.