Back in April of 2017 we kept hitting these delays with various parts of the pipeline, where customers would see their data being delayed for 20-30 minutes, while either our current queuing setup would block up with a single customer’s data, or particular destinations wouldn’t scale appropriately, as Alex was just talking about.
At that point, we said, okay, we need a bigger overhaul to the way that we actually deliver data outbound, which should rethink a bunch of the primitives that we built these individual queues per destination over the past two years, and should hopefully help us scale for the upcoming 3-5 years, as we 10x or 100x our volume.
And once we kind of acknowledged this was a problem, Rick Branson, who Alex has talked about a bunch, spearheaded the effort to actually architect the system that he called Centrifuge. Centrifuge effectively replaces the single queues that we have - one queue for Google Analytics, with one queue for Mixpanel, with one queue for Intercom, with what you can think of as almost being virtualized queues, or individual sets of queues on a per customer, per destination basis.
So we might have one queue for Google Analytics, which has all of Instacart’s data, but another one with all of New Relic’s data, and maybe another one with Fender’s data. This system - honestly, we hadn’t seen any really good prior art for.
[00:45:04.06] I think network flows are about the closest that you’d get to it, but those give you back pressure in terms of being able to say “Hey, there’s too much data here. Stop sending from the very TCP source that you have”, which is something that we can’t exactly enforce on our customers.
So with this design in hand for Centrifuge, we started out on what actually turned into about a nine-month journey where we decided to roll Centrifuge out into production, and Centrifuge was responsible for all of the message delivery, the retries, and archiving of any data which wasn’t delivered, and then separately, Centrifuge would make requests into this new integration’s monoservice, which you could think of as being this intelligent proxy which would take this raw data in, and depending on where it’s going, make multiple requests to a third-party endpoint.
And for the rollout process there, like I said, we spent maybe a month or so designing it, then we began to actually consolidate the repo and move it in to be a single monoservice. We started building out the bones of Centrifuge for another three or four months or so, and we started cutting over our first traffic after about a five-month period.
Now, when we started cutting over traffic, we had this interesting problem, where if we’re sending traffic via two pipelines, and we have to test it end-to-end in whatever destination tool, if we both just mirror traffic and let them both go, we’ll end up with double counts in Google Analytics or double counts in Mixpanel. So we actually added a kind of serialization point in the middle, that both set of microservices would talk to, as well as the monolith, and effectively we’d do kind of a first-write wins type of scenario, where it creates some locks in Redis, and then only one of the messages would succeed through either pipeline.
We basically slowly ramped traffic in that manner, always checking the end-to-end metrics on it, always making sure that no matter which pipeline we were using, the delivery rates looked perfectly good. I’m sure actually Alex can talk to more of that rollout period, because it was definitely a little bit rocky in terms of how we rolled out the system… But about 2-3 months after that, we had fully tested all the scaling, cut over 100%, and we were feeling much better about the system’s stability. Looking at it today, it’s actually a very rock-solid and well-utilized piece of infrastructure.