At Shopify, we run a large fleet of MySQL servers, with numerous replica-sets (internally known as “shards”) spread across three Google Cloud Platform (GCP) regions. Given the petabyte scale size and criticality of data, we need a robust and efficient backup and restore solution. We drastically reduced our Recovery Time Objective (RTO) to under 30 minutes by redesigning our tooling to use disk-based snapshots, and we want to share how it was done.

Challenges with Existing Tools

For several years, we backed up our MySQL data using Percona’s Xtrabackup utility, stored its output in files, and archived them on Google Cloud Storage (GCS). While pretty robust, it provided a significant challenge when backing up and restoring data. The amount of time taken to back up a petabyte of data spread across multiple regions was too long, and increasingly hard to improve. We perform backups in all availability regions to decrease the time it takes to restore data cross-region. However, the restore times for each of our shards was more than six hours, which forced us to accept a very high RTO.

While this lengthy restore time was painful when using backups for disaster recovery, we also leverage backups for day-to-day tasks, such as re-building replicas. Long restore times also impaired our ability to scale replicas up and down in a cluster for purposes like scaling our reads to replicas.

Overcoming Challenges

Since we run our MySQL servers on GCP’s Compute Engine VMs using Persistent Disk (PD) volumes for storage, we invested time in leveraging PD’s snapshot feature. Using snapshots was simple enough, conceptually. In terms of storage, each initial snapshot of a PD volume is a full copy of the data, whereas the subsequent ones are automatically incremental, storing only data that has changed.

In our benchmarks, an initial snapshot of a multi-terabyte PD volume took around 20 minutes and each incremental snapshot typically took less than 10 minutes. The incremental nature of PD snapshots allows us to snapshot disks very frequently, helps us with having the latest copy of data, and minimizes our Mean Time To Recovery.

Modernizing our Backup Infrastructure

We built our new backup tooling around the GCP API to invoke PD snapshots. This tooling takes into account the availability regions and zones, the role of MySQL instance (replica or master) and the other MySQL consistency variables. We deployed this tooling in our Kubernetes infrastructure as CronJobs, giving the jobs a distributed nature and avoiding tying them to our individual MySQL VMs allowing us to avoid having to handle coordination in case of a host failure. The CronJob is scheduled to run every 15 minutes across all the clusters in all of our available regions, helping us avoid costs related to snapshot transfer across different regions.

Backup workflow selecting replica and calling disk API to snapshot, per cron schedule
Backup workflow selecting replica and calling disk API to snapshot, per cron schedule

The backup tooling creates snapshots of our MySQL instances nearly 100 times a day across all of our shards, totaling thousands of snapshots every day with virtually no failures.

Since we snapshot so frequently, it can easily cost thousands of dollars every day for snapshot storage if the snapshots aren’t deleted correctly. To ensure we only keep (and pay for) what we actually need, we built a framework to establish a retention policy that meets our Disaster Recovery needs. The tooling enforcing our retention policy is deployed and managed using Kubernetes, similar to the snapshot CronJobs. We create thousands of snapshots every day, but we also delete thousands of them, keeping only the latest two snapshots for each shard, and dailies, weeklies, etc. in each region per our retention policy

Backup retention workflow, listing and deleting snapshots outside of retention policy
Backup retention workflow, listing and deleting snapshots outside of retention policy

Performing a Restore

Having a very recent snapshot always at the ready provides us with the benefit of being able to use these snapshots to clone replicas with the most recent data possible. Given the small amount of time it takes to restore snapshots by exporting a snapshot to a new PD volume, this has brought down our RTO to typically less than 30 minutes, including recovery from replication lag.

Backup restore workflow, selecting a latest snapshot and exporting to disk and attaching to a VM
Backup restore workflow, selecting a latest snapshot and exporting to disk and attaching to a VM

Additionally, restoring a backup is now quite simple: The process involves creating new PDs with source as the latest snapshot to restore and starting MySQL on top of that disk. Since our snapshots are taken while MySQL is online, after restore it must go through MySQL InnoDB instance recovery, and within a few minutes the instance is ready to serve production queries.

Assuring Data Integrity and Reliability

While PD snapshot-based backups are obviously fast and efficient, we needed to ensure that they are reliable, as well. We run a backup verification process for all of the daily backups that we retain. This means verifying two daily snapshots per shard, per region.

In our backup verification tooling, we export each retained snapshot to a PD volume, attached to Kubernetes Jobs and verify the following: