How we’re scaling our platform for spikes in customer demand

By Luke Demi

In 2017, the world saw an explosion in the popularity of cryptocurrency, and the ecosystem’s total market cap jumped from $20 billion to $600 billion. During this time, almost every component of Coinbase’s technology was battle-tested, and it showed our team that we need to focus on the reliability and scalability of our platform, just like we do with security. At MongoDB World 2018, myself, Michael de Hoog, and Jordan Sitkin, all engineers at Coinbase, gave a talk about the lessons we learned in 2017 and how we’re scaling our platform now. You can watch the talk here or read our recap below.

Our traffic patterns in 2016, the year before the explosion in cryptocurrency popularity, had been remarkably consistent. Ahead of this boom, if we had drawn a red line where we expected our platform to experience issues, we would have put it somewhere around four or five times our typical daily maximum traffic of about 100,000 backend API requests per minute.

Here’s a quick look at backend requests per minute in 2016 before the price of ether skyrocketed.

However, in May and June of 2017, the price of ether skyrocketed and traffic exploded past that red line. There were several days during this period where we experienced sustained traffic in this red zone, during which we experienced periods of downtime.

During the early heavy traffic period in 2017, here’s what backend requests per minute looked like.

To solve these scalability issues fast, the Coinbase engineering team started by focusing on the low-hanging fruit in our environment. We worked around the clock to perform tasks like vertically scaling, upgrading database versions to take advantage of performance improvements, optimizing indexes, and splitting out hotspot collections into their own clusters. Each of these improvements bought us headroom, but these low-hanging fruit were beginning to dry up, and traffic was continuing to climb.

During each outage, the pattern was the same: our primary monitoring platform would show a 100x spike in latency, along with a strange 50/50 split between Ruby and MongoDB time. As our primary datastore, it made sense that MongoDB time would experience this high-latency during periods of heavy traffic, but the Ruby time wasn’t adding up.

In earlier monitoring systems, this is how “the Ghost” appeared.

We affectionately began to refer to this issue as “the Ghost,” as our existing monitoring tools were unable to provide clear answers to some of our most critical questions. Where were these queries coming from? What did they look like? Why was there a correlated spike in Ruby time? Could the issue be originating on the application side?

Simply put, our existing monitoring services were not able to utilize all of the context available to us inside our environment. We needed a framework for answering and visualizing the relationships between our environment’s components.

We began to further instrument database queries by modifying MongoDB’s database driver to log all queries above a certain response time threshold, along with important context like the request/response size, response time, source line of code, and query shape.

Here’s a glance at the important context to be logged on all slow MongoDB queries.

Our improved instrumentation provided us detailed data that allowed us to quickly narrow in on some strange characteristics that were present, even during non-outage situations. The first major outlier we saw was an extremely large response size object originating from a device find query. These massive queries would result in a massive network load when our users would sign in to make purchases or view the dashboard.

The reason for this extremely large response size was that we had modeled the relationship between the users and device classes as a many-to-many relationship. For example, some users might have multiple devices, while some devices may have multiple users. A poor device fingerprinting algorithm had bucketed a huge number of users into the same device, resulting in a single device object with a massive array of user_ids.

To solve the issue, we refactored this relationship to simply be a one-to-many relationship, where each device maps to just a single user. The performance impact was dramatic and gave Coinbase its single biggest performance boost in 2017.

This finding illustrated the power of good monitoring. Before granularly instrumenting our database queries, this was a near-impossible issue to debug. With the new tools, it was now obvious.

Another issue we set out to solve was large read throughput on certain collections. We decided to add a query-caching layer that would cache query results in Memcached. Any single document queries on certain collections would first query the cache before querying the database, and any database writes would also invalidate the cache.

We were able to roll out this change across a number of database clusters simultaneously. The query cache was written at the ORM and driver level, which allowed us to affect multiple problematic clusters at once.

As it turns out, the massive surge in traffic we experienced in May and June was nothing compared to the surge we experienced just a few months later in December and January. With the help of these fixes and others, we were able to withstand even larger surges in traffic.

The spike from early 2017 is just a blip compared to December and January

Today we’re proactively working to make sure we’re prepared for the next surge in cryptocurrency interest. While it was easy to work on these improvements during the heat of a real firefight, we needed to find a way to improve our future performance, even during lower periods of traffic. The obvious answer is to load test our environment by emulating traffic patterns at several times the levels we experienced in the past to discover where our next weak point could originate.

Our chosen solution is to perform traffic capture and playback, specifically on our databases, to generate artificial “crypto mania” on demand. For us, this method is preferable to synthetic traffic generation, since it removes the requirement to keep synthetic scripts up to date. Every time we run the suite, we’ll be sure the queries map exactly the type of traffic our application is producing, based on our captured data.

To do this, we built a tool called Capture, which wraps an existing tool called mongoreplay. After choosing a specific cluster in our environment, Capture simultaneously kicks off a disk snapshot and begins to capture raw traffic on our application servers directed to that cluster. It then saves encrypted versions of these captures to S3 playback at a later time. When we’re ready to perform the playback, another tool called Cannon, based on mongoreplay, plays back the recorded traffic to a freshly launched cluster based on the previous cluster snapshot.

One challenge we faced is how to capture all of the MongoDB traffic for a single cluster across several application servers simultaneously. Cannon solves this by opening a 10MB buffer from each capture to simultaneously merge and filter the captures.

The final result is one merged capture file which can then be targeted by Cannon toward a freshly launched MongoDB cluster. Cannon allows you to choose exactly the speed to replay the capture in order to simulate loads thousands of times larger than what we may be experiencing on any given day.

We’re just getting started with Capture and Cannon and are excited to see the types of discoveries we find as we perform these types of load tests on all of our MongoDB clusters.

One major discovery as a result of our work with Capture and Cannon comes from one of Cannon’s debug features. Cannon has the ability to inspect a specific capture file and see the first 100 messages in it. Upon inspection, we noticed something interesting:

Notice the ping commands intermingled with the finds? Turns out that the MongoDB Ruby driver was not correctly following the MongoDB driver spec and was performing a ping command (to check replica set state) alongside each query to the database. While this behavior was unlikely to have been causing our downtime related issues, it was almost certainly the cause of the “ghost-like” behavior we observed in our monitoring.

After all the effort put into tackling these challenges, we’re proud of the current state of reliability at Coinbase. The events of 2017 reaffirm that a customer’s ability to access and view their funds is critical to our ability to fulfill our goal to be the most trusted place to buy, sell, and manage cryptocurrency. While security has always been our number one priority, we’re excited to focus on ensuring that the reliability of our platform is a top priority too!

We’ve formed three separate performance and reliability specific teams to prepare for future waves of cryptocurrency enthusiasm. If you’re interested in these types of challenges, please visit our careers page!

Except where noted herein, the images displayed are from Coinbase.