Thursday, September 6, 2018

Every time something broke and people got together to talk about it in a previous life, inevitably a question would come up: could we have caught this with a test, or with testing, or some other before-it-goes-to-the-world step? Many times, the question was just rote, since the answer was frequently "yes" or similar.

These are the easy ones. These are the problems like where someone sets a web UI element to disabled and nothing else comes along to enable it. There's no way to click on the greyed-out button, so the feature dies for everyone at once. That's obviously wrong, and a real-browser-ish test harness (or some poor human tester) that actually tried to click the button would have tripped on it.

So, okay, you say, ship it to a consistent 1% of the users, and see if their numbers change appreciably. This is pretty decent. If something's bad, sure, it'll break for them, but the rest of the world will never know. Then you can roll it back or work around the problem or *gulp* fix forward and keep on trucking.

This, too, is relatively easy. Once you know that you possess the power of consistently hashing users and selectively routing them to different versions of the site, you can exploit it to do your test bidding. You can also rig your logging to note which group a user is in when a feature or product gets triggered, and use that as a "group by" in your analysis queries.

The really hard ones are the ones which come down to scaling issues, and which do not appear during a small-scale rollout. I'll try to describe some hypothetical situations which might work this way.

Let's say you have a tier of 100,000 web servers. Every one of them opens one connection to your database. That connection is shared with all of the concurrent hits/code executing on those web servers. It just gets multiplexed down the pipe and off it goes.

Then, one fine day, someone decides to write a change that makes the web servers open four connections to the database. They've invented this new "pooling" strategy, such that requests grab the least-busy connection instead of always sharing the single one. It's supposed to help latency by n% (and get them a promotion, but let's not get into that now).

It's rolled into the next build of the site, and the push starts. It goes to 1% of the web servers, and those 1000 machines restart into the new code. They establish four connections each, so there are now 4000 connections from them to the database. The other 99000 machines are still using one connection each, so the total connection load on the database is approximately 103000 connections, just 3000 higher than before. This is about 3% higher, and it probably goes unnoticed. That much variance happens organically due to web server machines coming and going due to ordinary repairs and maintenances.

Things look fine, so the push continues. This time, it's going everywhere else: the other 99% of web servers. It gets cranking, and things start getting interesting. Every web server that gets upgraded comes back up and opens four connections. As the push drags on, the connection load on that database server grows and grows.

By the time it's done, assuming it survives, the database server now has 400,000 connections to it, which is four times the prior load. It's quite likely that something has blown up well before this point, and the site is now completely dead, given that it has no database.

Sure, you roll it back, and eventually figure it out, but in the mean time, millions of cat pictures have gone unserved, and people are demanding to know what you will do about it next time.

This one is truly difficult. To have any hope of doing something about it, you have to be keenly aware of what the nominal resource demands of your web servers may be. Then you also have to somehow be able to measure it on the fly and detect anomalous changes as they happen.

The really hard part is getting past the point where you're always "fighting the last war". That is, this outage had something to do with database connection counts, so, okay, you'll watch for those going up too fast, right? But that's the last war. The next one might involve something completely different with some other consumable that you don't even know about.

Also, keep in mind the example given here is a very simple one: something goes from 1-per to 4-per. What if it's more insidious, and involves a range of values that suddenly starts trending higher? Maybe the machines usually used between 5 and 10 of something, and then started using between 7 and 20 of something. It's subtle at first, but when subject to multiplicative effects, life quickly gets interesting.

Basically, given a large enough fleet of machines, you end up with your own private "botnet" that can be used to destroy arbitrary backend systems. There's frequently a "cliff", such that you can do whatever you want up to that point, but once you go too far, the whole thing falls over.

Finally, there's the non-technical angle to this, particularly when organizations grow to be large and there's no longer the "we're all on the same team" ethos. This is where someone on the receiving end of the onslaught notices the problem (increasing resource utilization) and calls it out, and the other side unapologetically keeps going. After all, they have a deadline to make (and a promotion to get), and you're "just being annoying".

In those cases, you can scream bloody murder that you're running out of resources due to their change, but unless you have some actual teeth and a way to enforce the agreed-upon limits, just what do you think is going to happen?

To further complicate matters, imagine that you don't notice it right away, but instead a week or two into their little "experiment", while doing something else entirely. What are the odds you'll be able to dial it back at that point?

And to think this all started from the realm of "can this be tested"...