The Joy and Pain of using GoogleĀ BigTable

By Jono MacDougall

Last year, I wrote about Ravelin’s use and displeasure with DynamoDB. After some time battling that database we decided to put it aside and pick up a new battle, Google Bigtable. We have now had a year and a half of using Bigtable and have learned a lot along the way. We have been very impressed by BigTable, it can absorb almost any load we throw at it but it isn’t without its own eccentricities and issues. Today I want to go through some of those lessons.

As previously discussed, BigTable was able to absorb the exact same workload that DynamoDB was failing to do while costing us less. Looking at the metrics, not only was it achieving that work load, we were faster and only using a small percentage of the CPU in our cluster. Things were looking very good.

Bigtable CPU usage under initial workload. This is with the minimum set of nodes. With wild over provision, DynamoDB would consistently throw throughput exceptions under this throughput.
A load test of BigTable. After scaling our node count performance was steady under heavy load.

We were impressed with this and, after some further load testing, decided to make the step to push more of our mission critical, real time work loads on to the database. These workloads had previously been served by a single postgres instance (a very large postgres instance). As we did not want to be stuck scaling an SQL database forever, we made the decisions early on to not use certain features of postgres. In particular, no joins, no transactions and an effort to keep indexes (and therefore query-able fields) to a minimum. If we had used all the bells and whistles Postgres provides this switch would have been more challenging.

In fact, as a general rule, you should always keep things simple even if your tools allow for more complex patterns. These patterns often bite you later. Complexity means you get bugs, challenging maintenance, vendor lock in, scaling issues or perhaps most importantly, the cognitive overhead of having yourself and new engineers understand these complex patterns. Always lean into the few features your tool is advertised to Do Well. This applies equally to your code.

So we moved away from Postgres and split our workload to Bigtable and ElasticSearch. ElasticSearch is used to power our dashboard which requires more complex query patterns while Bigtable powers our realtime fraud API.

From the start, we were impressed with how much traffic BigTable could handle. We lowered our monthly bill while increasing the amount of scaling headroom in our storage layer.

But it wasn’t all roses.

Our initial implementation relied on scanning Bigtable. This was a mistake.

First, a quick primer on Bigtable:
Bigtable is essentially a giant, sorted, 3 dimensional map. The first dimension is the row key. One can look up any row given a row key very quickly. You can also scan rows in alphabetical order quickly. You can start and end the scan at any given place. One caveat is you can only scan one way. Which is annoying. But ho hum.
The second dimension are columns within a row. Columns are keyed by a string and can be segmented into column families. From what we have seen with our use of column families, they are not much more than a prefix on your column key. You can filter columns individually or via column families to return only those columns you are interested in.
Finally, every column contains a set of cells. Cells hold your values which are just bytes. This is the third dimension of our 3D map. Cells are indexed by a timestamp in milliseconds. You can filter your request to ask for the latest cell or a custom range of cells.
So any particular value in Bigtable is identified by its row key, column key (including column family) and cell timestamp.
Congratulations, you are now a Bigtable expert!

We had used the fact we can do fast, ordered scans across a range of row keys. We constructed a key that meant all the data we would want for the feature extraction process in our fraud API would live under a particular key prefix. Then we would scan over that prefix at feature extraction time.

Scan performance was… variable. We found that it can be reasonably quick (sub 50ms). But our 99th percentile timings were not great. The increase was often proportional to the number of rows in the scan but sometimes it wasn’t. Sometimes even smaller numbers of rows were slower than we would like. On the order of hundreds of milliseconds.

To solve this we decided to change the way we were storing our data. Instead of spreading it across multiple rows we would blob the data together and store it in one row. Each entity type is stored under a single column with individual entities in a single cell in time order. This improved our 95th latency by about 3x. We had to use some tricks to ensure we didn’t get overly large rows. Luckily, those sorts of rows were by far the exception rather than the rule.

While this was better, it wasn’t the panacea we hoped for.

This is true not just of Bigtable but any managed service. When they go wrong, they can be the devil to debug. You don’t have access to the service directly — you are at the whims of whatever tooling your provider has given you. In a perfect world, this would be fine. But nothing is perfect and things do go wrong. It is a trade off you have to make when using managed services. You will have problems that you are powerless to solve or even understand.

Tooling, monitoring and support are some of the most important features of any managed service. They must be a core part of your decision making process.

In our case these problems manifest as periodic, short lived unavailability.

Once or twice a week, for a period of less than a minute, often times less than 10 seconds, a portion of requests (typically less than 1%) to Bigtable would fail or become extremely slow. During this time, retries would also fail meaning we would end up with a 500 back to our client. While 1% isn’t a total outage, it does equate to a significant absolute number of requests when you are at high throughput.

A particularly bad case. The top graph shows latency against our Bigtable cluster while the bottom is 500s as reported from our load balancer. More typical cases would involve the loss of 20–60 requests but this happened to hit when we had higher throughput. It recovered within 1 minute.

So we dug in to investigate.

When we first started using Bigtable, the monitoring and tooling weren’t great. We used the Datadog integration which gave us a decent number of metrics but it was still challenging to get some of the metrics we wanted. The Bigtable dashboard in the cloud console was fairly bare bones. Giving you only a small number of metrics and missing any latency metrics. It was not up to the task of debugging this issue.

You do not get any logs at all out of Bigtable and no visibility into cluster operations. There is no way to see if an instance was lost. No indication of internal operations such as rebalancing. No visibility into error logs or logs of any kind from the cluster. This is a huge sore point. If we could see some of this then we could at the very least rule out what might be causing the problems.

We dug into the client. We forked the Golang library to add instrumentation everywhere we could find. While this did give us great insight into the Bigtable interface and into GRPC in general it did not give us any indications as to why we saw these occasional issues.

Over the past year the tooling has gotten much better. The Key Visualizer is fantastic. Anyone who hasn’t looked at it and is using Bigtable should stop reading and check it out right away. It makes it very clear if you have a hot key or large rows. Unfortunately for us, we did not have something as easy to fix as a hot key or large row.

The console dashboard has improved as well though it could still use work. Oddly, it still does not show response times. For us, this is the most important metric and it is hidden away in Stackdriver.

After much investigation, we are still unsure the root cause of why this happens.

So in our desperation, we just started to just make guesses.

  • The cluster is “rebalancing”. Sounds like a real thing right?
  • We lost a node. Maybe? No way to tell in the tooling they give us.
  • There was an on host migration. This happens to our compute instance nodes and we have seen it cause small periods of unavailability so could be?
  • Network. It is always networking. The ephemeral wisp that is a networking error. The ever present cry of the developer who has no idea what is going on… It’s gotta be the network — It’s almost never the network.
  • Wait, what about authentication! It has to renew the token at some point and there is a brief period where we fail? Like a race or something. Could be right? Probably not…
  • So it’s networking then.
  • Or no, rebalancing because it seems to happen slightly more often after we do a cluster resize
  • But we didn’t do a cluster resize last time the issue struck
  • Plus the resize before that was smooth!
  • We probably lost a node then
  • arrrrrgggg…………………….

At the end of the day we had all developed a sort of Bigtable religion . Some sort of pseudoscience on why these unavailability events would happen.

During the processes, we engaged Google support. I have mixed feelings about their support. It can take quite some time to get past the first layer of support and through to someone who actually knows what they are talking about. There is an ever present feeling that they care more about closing your ticket than they do fixing your problem. But once you do get through and once you make the scope of the problem clear, they can be useful. We have had multiple sit downs with engineers working directly on Bigtable. We regularly get visits from the PM of the Bigtable project. We discovered that, yes, sometimes the issue is because we lost an instance but that couldn’t explain everything, only a few occurrences. While the time they gave us was admirable and their intentions were good , they struggled to find us a root cause of the issue.

CAP theorem provides the only real answer we have to our problem however unsatisfying that answer is. Bigtable is strongly consistent. This is a great feature but Bigtable has paid dearly for it. It has to sacrifice availability at the alter of the gods of distribution (I told you we made a religion out of this!).

CAP theorem is a theory of distributed systems. It states that you can only have two of the following properties
Consistency: The data is the same everywhere. If you do a write and then a read that read will contain that write.
Availability: It stays up
Partition Tolerance: If one instance can’t talk to another the cluster can survive
You kind of need partition tolerance in most cases. If you don’t have it then you can get in a real mess when (I do mean when, not if) a partition eventually happens. But you can trade between the other two properties. Cassandra trades consistency for availability, for example. Many NoSQL solutions make that trade. Bigtable went for consistency.
Congratulations, You are now a distributed systems expert!

While this isn’t the most satisfying answer, it also gives us our solution. Last year Google released region replication for Bigtable. Region replication allows you to spin up another Bigtable cluster in a separate region with data replicated across both clusters. The replication guarantees eventual consistency. It can be used to increase availability.

Our plan was to spin up another cluster and send retries to that cluster. The interface actually allows this to happen automatically. Unfortunately, we do multiple operations on Bigtable in one request to our api and we rely on strong consistency between those operations. So the automated failover will not work for us. Instead, we retry the whole request to the other cluster. This includes all the many Bigtable operations that we do ensuring we are strongly consistent within a single request while being eventually consistent across multiple requests when in a failure state. We can tolerate this level of eventual consistency.

This has largely solved our problem. It has also given us the ability to give better QoS for our most critical traffic. Tables that are less important to us are routed to our secondary cluster while our primary cluster only handles traffic that is of critical importance.

If you use Bigtable, you should be using region replication and failover. Otherwise, you will see failed requests.
A more recent case of Bigtable freaking out. The blue line is the latency of our primary cluster. The purple is our secondary cluster. Notice that the issue only strikes our primary cluster. The bottom graph is a blessedly clear of 500s over that same period.

Bigtable is certainly not without its problems but now that replication is available these problems are manageable. We continue to be impressed with its ability to absorb high throughput traffic without breaking a sweat. Scaling is just a simple push button operation and it requires no other management or configuration. We wish it had better tooling particularly around logging and availability is a problem but overall it will handle what you throw at it.