Last year, I wrote about Ravelin’s use and displeasure with DynamoDB. After some time battling that database we decided to put it aside and pick up a new battle, Google Bigtable. We have now had a year and a half of using Bigtable and have learned a lot along the way. We have been very impressed by BigTable, it can absorb almost any load we throw at it but it isn’t without its own eccentricities and issues. Today I want to go through some of those lessons.
As previously discussed, BigTable was able to absorb the exact same workload that DynamoDB was failing to do while costing us less. Looking at the metrics, not only was it achieving that work load, we were faster and only using a small percentage of the CPU in our cluster. Things were looking very good.
We were impressed with this and, after some further load testing, decided to make the step to push more of our mission critical, real time work loads on to the database. These workloads had previously been served by a single postgres instance (a very large postgres instance). As we did not want to be stuck scaling an SQL database forever, we made the decisions early on to not use certain features of postgres. In particular, no joins, no transactions and an effort to keep indexes (and therefore query-able fields) to a minimum. If we had used all the bells and whistles Postgres provides this switch would have been more challenging.
In fact, as a general rule, you should always keep things simple even if your tools allow for more complex patterns. These patterns often bite you later. Complexity means you get bugs, challenging maintenance, vendor lock in, scaling issues or perhaps most importantly, the cognitive overhead of having yourself and new engineers understand these complex patterns. Always lean into the few features your tool is advertised to Do Well. This applies equally to your code.
So we moved away from Postgres and split our workload to Bigtable and ElasticSearch. ElasticSearch is used to power our dashboard which requires more complex query patterns while Bigtable powers our realtime fraud API.
From the start, we were impressed with how much traffic BigTable could handle. We lowered our monthly bill while increasing the amount of scaling headroom in our storage layer.
But it wasn’t all roses.
Our initial implementation relied on scanning Bigtable. This was a mistake.
First, a quick primer on Bigtable:
Bigtable is essentially a giant, sorted, 3 dimensional map. The first dimension is the row key. One can look up any row given a row key very quickly. You can also scan rows in alphabetical order quickly. You can start and end the scan at any given place. One caveat is you can only scan one way. Which is annoying. But ho hum.
The second dimension are columns within a row. Columns are keyed by a string and can be segmented into column families. From what we have seen with our use of column families, they are not much more than a prefix on your column key. You can filter columns individually or via column families to return only those columns you are interested in.
Finally, every column contains a set of cells. Cells hold your values which are just bytes. This is the third dimension of our 3D map. Cells are indexed by a timestamp in milliseconds. You can filter your request to ask for the latest cell or a custom range of cells.
So any particular value in Bigtable is identified by its row key, column key (including column family) and cell timestamp.
Congratulations, you are now a Bigtable expert!
We had used the fact we can do fast, ordered scans across a range of row keys. We constructed a key that meant all the data we would want for the feature extraction process in our fraud API would live under a particular key prefix. Then we would scan over that prefix at feature extraction time.
Scan performance was… variable. We found that it can be reasonably quick (sub 50ms). But our 99th percentile timings were not great. The increase was often proportional to the number of rows in the scan but sometimes it wasn’t. Sometimes even smaller numbers of rows were slower than we would like. On the order of hundreds of milliseconds.
To solve this we decided to change the way we were storing our data. Instead of spreading it across multiple rows we would blob the data together and store it in one row. Each entity type is stored under a single column with individual entities in a single cell in time order. This improved our 95th latency by about 3x. We had to use some tricks to ensure we didn’t get overly large rows. Luckily, those sorts of rows were by far the exception rather than the rule.
While this was better, it wasn’t the panacea we hoped for.
This is true not just of Bigtable but any managed service. When they go wrong, they can be the devil to debug. You don’t have access to the service directly — you are at the whims of whatever tooling your provider has given you. In a perfect world, this would be fine. But nothing is perfect and things do go wrong. It is a trade off you have to make when using managed services. You will have problems that you are powerless to solve or even understand.
Tooling, monitoring and support are some of the most important features of any managed service. They must be a core part of your decision making process.
In our case these problems manifest as periodic, short lived unavailability.
Once or twice a week, for a period of less than a minute, often times less than 10 seconds, a portion of requests (typically less than 1%) to Bigtable would fail or become extremely slow. During this time, retries would also fail meaning we would end up with a 500 back to our client. While 1% isn’t a total outage, it does equate to a significant absolute number of requests when you are at high throughput.
So we dug in to investigate.
When we first started using Bigtable, the monitoring and tooling weren’t great. We used the Datadog integration which gave us a decent number of metrics but it was still challenging to get some of the metrics we wanted. The Bigtable dashboard in the cloud console was fairly bare bones. Giving you only a small number of metrics and missing any latency metrics. It was not up to the task of debugging this issue.
You do not get any logs at all out of Bigtable and no visibility into cluster operations. There is no way to see if an instance was lost. No indication of internal operations such as rebalancing. No visibility into error logs or logs of any kind from the cluster. This is a huge sore point. If we could see some of this then we could at the very least rule out what might be causing the problems.
We dug into the client. We forked the Golang library to add instrumentation everywhere we could find. While this did give us great insight into the Bigtable interface and into GRPC in general it did not give us any indications as to why we saw these occasional issues.
Over the past year the tooling has gotten much better. The Key Visualizer is fantastic. Anyone who hasn’t looked at it and is using Bigtable should stop reading and check it out right away. It makes it very clear if you have a hot key or large rows. Unfortunately for us, we did not have something as easy to fix as a hot key or large row.
The console dashboard has improved as well though it could still use work. Oddly, it still does not show response times. For us, this is the most important metric and it is hidden away in Stackdriver.
After much investigation, we are still unsure the root cause of why this happens.
So in our desperation, we just started to just make guesses.
- The cluster is “rebalancing”. Sounds like a real thing right?
- We lost a node. Maybe? No way to tell in the tooling they give us.
- There was an on host migration. This happens to our compute instance nodes and we have seen it cause small periods of unavailability so could be?
- Network. It is always networking. The ephemeral wisp that is a networking error. The ever present cry of the developer who has no idea what is going on… It’s gotta be the network — It’s almost never the network.
- Wait, what about authentication! It has to renew the token at some point and there is a brief period where we fail? Like a race or something. Could be right? Probably not…
- So it’s networking then.
- Or no, rebalancing because it seems to happen slightly more often after we do a cluster resize
- But we didn’t do a cluster resize last time the issue struck
- Plus the resize before that was smooth!
- We probably lost a node then
At the end of the day we had all developed a sort of Bigtable religion . Some sort of pseudoscience on why these unavailability events would happen.