Observability — A 3-Year Retrospective

A summary of the observability movement over the past three years.

Aug 6th, 2019 3:00am by Charity Majors

Featued image for: Observability — A 3-Year Retrospective

Feature image via Pixabay.

Observability, the development approach or moreover the “movement’” is about three years old and Charity Majors, one of the early pioneers in the field, has decided to take a step back, pan out and “observe” how far it has come. In this article, she takes a closer look at why it formed and why other approaches and methods fall short, citing important contributors along the way. She explains the criticality of adopting observability for any engineering team building and maintaining complex, distributed systems.

Charity Majors

Charity Majors is the co-founder and CTO of Honeycomb.io, provider of tools for engineering and DevOps teams to debug production systems faster and smarter. Previously Charity ran infrastructure at Parse and was an engineering manager at Facebook, where she ran next-generation distributed systems at scale. After leaving Facebook, Charity and her co-founder started Honeycomb to help engineering teams ship code faster and more safely with observability tooling natively designed for this new era of chaotic, ephemeral, loosely coupled distributed systems. Charity is the co-author of Database Reliability Engineering (O'Reilly) and is devoted to a world where every engineer is on call and nobody thinks on call sucks.

Like so many other terms in software engineering, “observability” is a term borrowed from an older physical discipline: in this case, control systems engineering. Observability is the mathematical dual of controllability.

“Less formally, this means that one can determine the behavior of the entire system from the system’s outputs. If a system is not observable, this means that the current values of some of its state variables cannot be determined through output sensors.”

Before we can understand why observability is a meaningful technical term and not just a product marketing term, we need to understand some things about monitoring, metrics, events, and a brief history of how we have tried to make sense of our systems. We need to understand how the world has changed to understand why this, why now.

Ever since there have been terminals and high-level computer languages, there have been logs: strings outputted to a human-readable device, to help humans understand what was happening. Next came early text-based systems packages sysstat, sar, iostat, netstat, mpstat etc. which are still the best way to understand single node performance. Then in 1988 SNMPv1 was born — the first metrics-based telemetry system (that I know of).

Metrics do not equal observability.

Metrics have ruled the day since the 80s: snmp, rrdtool, cacti, Ganglia, Etsy’s monumentally influential statsd. Their modern successors include SignalFX, DataDog, Wavefront, among others. The user experience has come a long way, but all of these tools are built atop the metric as unit of work: a single number, with tags appended so you can slice and dice, group and locate various metrics and their sources.

A request making its way through your code might emit dozens or even hundreds of metrics before it’s finished — gauges, counters, and other numbers that represent details like CPU load, resident memory size, I/O stats, etc. and tagged with information like build ID, AWS region, and so forth. Metrics typically get aggregated at write time and lose granularity as they age out, which makes them very efficient to collect and store. Metrics are still the dominant way to understand how your infrastructure system as a whole is behaving, and probably always will be. But don’t miss that note above about “lose granularity” — it’s important to keep in mind because we’ll get back to it; metrics do not equal observability.

The ‘00s Spawned a New Vendor Breed: APM

About 10 years ago, a crop of new providers emerged under the APM (Application Performance Management) umbrella. NewRelic, AppDynamics, and others billed as a better way to understand your application code. Instead of using an agent, you might install their libraries in your code, which would then track requests and report on language internals as well as request specifics. They usefully generate lots of top-10 lists so you could understand where your performance problems were coming from — by endpoint, by query, and so forth.

These tools were a major step forward. They were still mostly metrics-based under the hood, but the perspective shift from third-party observer to the first-person observer allowed for far greater introspection of your software and its behavior.

Tools have come a long way. Yet still just five short years ago when I was at Parse (subsequently acquired by Facebook) grappling with a platform that was going down constantly and suffering from unpredictable co-tenancy problems, I tried all of these tools and more, and none of them helped resolve system performance and reliability. Let me repeat: none of them did what they claimed to do. This isn’t because they lied or misrepresented themselves, it’s because the kinds of systems we were building were fundamentally different than the systems those tools were developed to understand. Parse was an early adopter of a lot of trends which are still relatively cutting edge, and more and more people are now experiencing the consternation and frustration that I did during that time. These older tools, once revolutionary, simply no longer work for our current systems.

Cardinality and Its Relation to Complex Distributed Systems

To fully grasp the “why” — first you need to understand how today’s systems we are building are different (and why), and core to that is understanding something called cardinality.

Cardinality refers to the number of unique items in a set. Any unique ID will always be the highest possible cardinality, and a single value will always be the lowest possible cardinality. If you had a collection of a hundred million user records, you can guess confidently that Social Security numbers will have the highest possible cardinality; first name and last name will be high cardinality, though lower (because some names repeat); gender will be fairly low-cardinality, and “Species: human” will, presumably, be the lowest possible cardinality, should you actually bother to record it.

Without Access to High Cardinality Data, Good Luck Debugging

Why does this matter? Because high-cardinality information is the most useful data for debugging or understanding a system (think user IDs, shopping cart IDs, request IDs … basically any IDs and also instances, container, build number, span ID, etc). Unique IDs will always do the best job of identifying individual needles in a given haystack.

Yet metrics-based tooling systems can only deal with low-cardinality dimensions at scale. If you have even merely hundreds of hosts, you can’t put the hostname in as an identifying tag, or you’ll blow out your cardinality key-space. Likewise, for every question you want to ask with a metrics-based tool, you have to decide to ask it in advance so the answer can be written out at write time. This means a) if you want to ask any new question after the fact, you can’t, and b) cost goes up linearly with the number of questions you want to be able to ask(!).

When we blew up the monolith into many services, we lost the ability to step through our code with a debugger: it now hops the network. Our tools are still coming to grips with this seismic shift.

For a long time, this didn’t matter so much, because high-cardinality dimensions were pretty rare. With a typical monolithic system, you had a single app tier and one database. All the interesting logic was hidden inside the application code, where you could step through it with a debugger like gdb. You could look at your dashboards and intuit instantly which of the three, four, five monolithic components was at fault, when faced with an issue to troubleshoot.

This is increasingly untrue. Look at all the infra/architecture trends of the past five-plus years. Containers, schedulers, microservices, polyglot persistence, mesh routing, ephemeral auto-scaling instances, serverless, lambda functions. A request may do 20-30 hops after your edge — or a multiple or two of that if you count database queries. Suddenly, the hardest thing about debugging systems is no longer understanding how the code runs but finding where in your system is the code with the problem. You usually can’t just look at a dashboard or service map and see which node or service or component is slow because it loops back into itself — when anything gets slow, EVERYTHING gets slow. Today’s modern cloud native environment is essentially a platform, meaning the code “inside” and queries may not even be under a single team’s control.

When we blew up the monolith into many services, we lost the ability to step through our code with a debugger: it now hops the network. Our tools are still coming to grips with this seismic shift.

Back at Parse, a user might complain: “Parse is down!” Our monitoring tools would clearly show that Parse was not down. So what was the user’s complaining about? Well, we would dispatch someone to investigate, but it was not entirely clear; developers were able to upload their own code and queries and we had to make them work, on hardware shared with hundreds of thousands of other neighboring apps. So the problem could be a) a code change or query of theirs, b) a code change or query of ours, c) some intersection of the two, d) some code change or query of a user sharing resources with the complaining user, e) some code change or query we did which affected a user sharing resources with them, f) any intersection of the above. Oh, and we had over a million apps, we shipped code constantly which affected all of them, and each of them had their own entire ecosystem of users and shipped their own code which affected their own users constantly.

Good times.

Monitoring tools are effective for systems with a stable set of known-unknowns, and relatively few unknown-unknowns. For a system with predominantly unknown-unknowns, monitoring tools were all but useless. We literally had to debug them by hand, painfully and slowly. It would often take a day or more to track down a user’s complaint, or to decide if it was actually on “their” side. The solution that finally saved our asses was Facebook’s Scuba which we started to use once Parse was acquired. We started feeding datasets into Scuba, and were able to slice and dice data by ad hoc dimensions — by userID, then endpoint, then query, etc. This dropped the time it took for us to identify a problem down from a day or more to … seconds, usually, or a small number of minutes.

This experience made a deep impression on me, though I had no words to describe it at the time. It wasn’t until I stumbled across the term “observability”, and looked up its origin, that I realized how much it had to teach us about building understandable software systems.

What Observability Looks Like in Practice

With that trip down memory lane, let’s revisit the definition:

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

Less formally, this means that one can determine the behavior of the entire system from the system’s outputs. If a system is not observable, this means that the current values of some of its state variables cannot be determined through output sensors.

When you’re flipping through a bunch of dashboards trying to figure out what’s happening, you aren’t inspecting what’s happening and reasoning about it, or following a trail of meaningful breadcrumbs. You’re jumping straight to the end — a guess. It’s as though the entire system were still a big black box, and you had no information to reason about what happened in what sequence. Your mental process probably looks something like this:

I see a huge spike in errors at the edge at 2 p.m. today. That’s unexpected and is probably bad. It looks like it correlates with a spike in errors to this particular DB cluster. Last time this happened, it was because someone was running a bad long-running write query, which held a contentious lock and then replicated the same write to the followers. I am going to check for a long-running query and growing queue length around that time to validate my hypothesis. Cool, I found one — that confirms my theory.

Notice how much it relies on intuitive leaps and your mental library of past outages. Ideally, rather than jump immediately to possible solutions, you would start out at the top with an open mind — “what happened?” and systematically follow the data-driven breadcrumbs to the verifiable solution, whatever it might be. Like this:

I see a huge spike in errors at the edge at 2 p.m. today. Let’s explore. I will break down by replica set (or endpoint, or user, or literally anything, but let’s say replica set for now). Okay, there’s a huge spike in errors to replica set 1, and a much smaller spike in errors to two other replica sets, 4 and 5. Interesting. The queue lengths appear to steadily climb through the length of the spike on replica set 1, and bounce up and down through the duration on 4, and just a short spike on 5. There is a long-running query on rs1, and not the other two. I wonder if it’s a transient problem with EBS/IOPS — so I will break down by replica set node and availability zone, which shows me that they are on different AZs. Cool, ruled that one out. Is it a migration? No, the build_id didn’t change. Let me sum up the lock time being held and break down by user_id and sort to see who is holding the lock and for which query — AH! this is coming from a background expiration job from cron! It lasted longer on RS1 than RS4/5 because there was more older data there. Let’s rework it so it doesn’t have to contend for the lock in this way.

I’m explaining this the long, painful, manual way of narrowing down your hypotheses step by step. Debugging this way involves lots of small, verifiable hypotheses, one after another, like breadcrumbs. It works only because I can break down by every dimension, including ones with very high cardinality. It works only because I have done no pre-aggregation before writing records to disk, in fact, every query I issue aggregates at read time to answer my question. It works only because I have gathered the data at the right level of abstraction, oriented around the request and its units of work — because the request is what maps to the user’s real lived experience.

Cardinality limitations aren’t the only problem with metrics-based systems, either. A metric is a single number with tags appended to it. A request might fire off a dozen or a hundred metrics as it executes … but those metrics are all disconnected from each other and can never be reconnected again, unlike an arbitrarily wide event which knits together all the details and context for that request. A lot of debugging consists of looking at an anomalous spike or some other shape, then figuring out what characteristic(s) or outliers the errors all have in common. You cannot do this if you’ve already stripped away the connective tissue of the event; then all you can do is guess. That’s not debugging, that’s magic.

By using events and passing along the full context, conversely I can ask any question of my systems and inspect its internal state, therefore I can understand any state my system has gotten itself into — even if I have never seen it before, never conceived of it before! I can understand anything that is happening inside my system, any state it may be in — without having to ship new code to handle the state. This is key. This is observability.

You can always understand the things you predicted and checked for. But if you checked for something, that implies you knew to expect it, which creates a Catch-22. The reason monitoring worked so well for so long was, we could predict most of the states our systems would get into. Connections would fill up, CPU would overload, you would need to add more app capacity or tune your database, etc. You could predict most of them, and you’d find the rest out the hard way. Systems were relatively stable, and the only unpredictable problems would be the ones triggered by your own team deploying code, which is why so many teams are so terrified of deployments.

A system is observable to the extent that you can understand new internal system states without having to guess, pattern-match, or ship new code to understand that state. This, to me, is the most useful way to extend the control theory concept to software engineering. The ratio of known-unknown system states to unknown-unknown system states is dropping like a rock for most. The unknown-unknowns now rapidly outpace monitoring dashboards capability to explain them to the humans responsible for continuous uptime, reliability and acceptable performance.

This is a technical distinction worth preserving and distinguishing from mere telemetry, because the ability to understand unknown-unknown states is what so many teams currently lack, and that lack is actively hurting them every single day. With modern distributed systems and the platform-ification of services, unknown-unknowns are most of what you will have to deal with for the rest of your life. It’s worth getting good at explaining them. It’s worth preserving a technical vocabulary for these solutions

Observability and Its Progenitors

Over the past three years, there is no doubt that “observability” has taken off quickly, and in part, this is due to others jumping on the “movement” in spite of considerable attempts at diluting the term and confusing practitioners. Let’s take a step outside and see what happened across the market, following a chronological timeline.

I am not the first person to mention observability in the context of software. The first time I heard the term was when I came across Twitter’s monitoring team, then named “observability engineering.” They used observability as a straight-up synonym for telemetry, as far as I could tell.

The second time I heard the term used was in the article by Cory Watson, “Creating a culture of observability at Stripe.” Cory was the manager of the Twitter team. That heritage is the only one I am aware of prior to when Honeycomb began adopting, promoting it and building a software solution towards that end.

In parallel, tech titans including Google were developing their own observability platforms, but not calling them “observability” per se. Those tools, such as Dapper, Monarch, and Dremel, weren’t externalized at all, built initially for Google’s internal use, until the Google Cloud Platform, came along at which time they needed an external market-driven name for it, and, oh hey look, Observability is a thing now? Sometimes it takes a few bigger players to make it a movement.

Honeycomb’s principal dev advocate Liz Fong-Jones, formerly with Google shared, “We were calling what we were doing “monitoring,” sometimes, except it had moved far beyond what people in the outside world though was “monitoring” (Nagios, Splunk, etc.). but we also had started using the words “observability,” “observe,” “observation” in parallel in the SRE book published March 2016“. By November 2017 Google engineer Jaana B. Dogan was using “observability” full-throated, but to refer to systems data that were aggregated, because that was what worked for us at that scale.

Honeycomb was founded on Jan. 1, 2016. We spent the first year building the storage engine and query planner, laying a strong foundation for Honeycomb’s product that would scale without compromising on cardinality limits. Halfway through 2016, I was tied up in knots trying to figure out how to describe what we were doing and the impact it would make to software engineering teams living through the pain I witnessed firsthand at Parse/Facebook. Early efforts “Business Intelligence for systems,” “strace for distributed systems” didn’t quite get the impact I was looking for and just seemed a little too narrow at the time. According to my chat logs, I looked up the definition of observability in July 2016, and began talking about it nonstop after that.

As a harbinger of the coming storm, Gregory Poirier sounded off at Monitorama in June 2016 about the struggles of folks trying to run production systems using the then-state of the art in a talk titled “Monitoring Is Dead.”

In May of 2017, I gave a talk at Monitorama called “Monitoring: A Post Mortem” where I talked about the cardinality limitations and other inherent problems with the monitoring model, and on the last slide welcomed everyone back next year to “observability-orama”. As you’d expect, it made some people grumpy on Twitter.

And in September ‘17, engineer Cindy Sridharan wrote an influential piece on observability, where she mostly adopted our frame and shone more light on the difference between known-unknowns and unknown-unknowns). This validated what we had begun to realize, which was that there was a deep well of growing dissatisfaction with monitoring, APM, and Logs tools — they simply no longer met people’s needs. Engineers were extremely receptive to the way we described the problems with monitoring software, and were ready, willing and eager to hear about what was next and how it was going to solve the issues faced every day that were on the rise.

In 2017, Peter Bourgon also published an article saying that observability has three pillars. While vendors eagerly latched on to this alternate definition, I’ll let Ben Sigelman of Lightstep refute this most thoroughly here.

In 2018, the QCon conference added an “observability” track to its conference. The Serverlessconf series was also an early and enthusiastic adopter of observability — which makes perfect sense because there are characteristics of serverless that align perfectly with the newer model: viewing the world purely through the lens of your instrumentation, not logging to disk, aggregating lots of information densely by request, etc.

Microservices communities and Kubernetes adopters were also early and quick to embrace observability. Because once you’ve blown up (or decomposed) the monolith, most of your “traditional” debugging tools no longer work. You have to return to first principles and make all these decisions again and aggregation on the request ID becomes of paramount importance; the hardest part is figuring out where the problem is in your distributed system, not debugging the code itself.

And between 2017 and 2018, literally every vendor in the monitoring, APM and log management market segments added the term observability to their content, sites and marketing language. Which brings us to the next section on how many misuse observability and misguide others.

How Observability’s Purpose and Value Have Been Misrepresented

In early 2018 I noticed that vendors had latched on to “distributed tracing, metrics, and logs” as “three pillars of observability.” Ben Sigelman neatly debunked this, saying: it makes no sense because those are just three data types. You may achieve [observability] with all three, or none — what matters is what you do with the data, not the data itself.

If you attend an industry conference today, you’re likely to hear the speakers adhere to the accurate definition — that observability is how you explain unknown-unknowns, that it’s about exploration and debugging instead of dashboards and pattern matching or accessing certain data types. I am impressed and delighted that practitioners have remained mostly impervious to the blanket marketing being done by so many tool vendors pushing the “three pillars” definition on users, yet I wonder how much longer they can hold true against the millions of dollars of ad spend deployed with the goal of shipping more software and increasing spend. I’m hopeful and I’m not sure if this is just the “elites” or if the more technical definition is finally trickling down to the general population — but I do see early signs of that now over the past 6 months.

I am sometimes criticized for using observability as my own product marketing term and policing its definition. People accuse me of defining observability to mean “the set of features that make up honeycomb.” It’s a fair criticism! but they get the causation backward. Instead of defining observability as “what honeycomb does,” observability came first. I spent years grappling with these problems, and months stewing over the observability definition and the side effects, the implications for an observability solution. For example, observability is impossible without:

raw events
high cardinality dimensions
no pre-aggregation, no pre-indexing (which lock you into asking predefined questions)
read time aggregation
arbitrarily wide events
schema-less-ness
structured data
oriented around the lifecycle of the request
batched up context
not metrics-based
static dashboards don’t work, it must be exploratory
etc.

And then we built honeycomb precisely to that spec.

Hell yes, I will police how people use it to some extent — I desperately want it to be a real technical term with real meaning. We need that specific technical language to grapple the problems we face as software engineering teams. We do not need another synonym for telemetry; of those we already have plenty.

If we do not appropriate “observability” to denote the differences between known unknowns and unknown-unknowns, between passive monitoring and exploratory debugging, it is not clear what other terms are available to us (and unclear that the same fate will not befall them). I believe it will set the industry back by years if we cannot clearly articulate the (substantial) technical differences between monitoring and observability. But this will be up to the engineers in the field, the only people with the ability to hold vendors accountable for their language — or not.

The Future of Observability

Three short years into this ride, I ponder the question; What’s next and where will this movement take us? I believe that in the next ~3 years, all three of those categories — APM, monitoring/metrics, logs, and possibly others — are likely to cease to exist. There will only be one category: observability. And it will contain all the insights you need to understand any state your system can get itself into.

After all, metrics, logs, and traces can trivially be derived from arbitrarily wide structured events; the reverse is not true.

Users are going to start to figure out that they are paying multiple times to store single data sets they should only have to store once. There is no reason to invest budget with separate monitoring vendors, logs vendors, tracing vendors, or APM vendors. If you collect data in arbitrarily wide structured events, you can infer metrics from those, and if you automatically append some simple span identifiers, you can use those same events for tracing views. Not only can you cut spending by 3-4X, but it’s phenomenally more powerful if you can use a single tool and fluidly flip back and forth between the big picture (“there’s a spike”) and drilling down to the exact raw events with the errors. Next, compute what outlier values they have in common, trace one of them, locate wherein the trace a problem lives, and figure out who else is impacted by that specific outlier behavior. All conducted in one single solution with all teams getting the same level of visibility.

Right now this is either a) impossible, or b) a human being has to copy-paste an ID from one system to another to the next. This is wasteful, slow, and cumbersome, and extremely frustrating for the teams that have to do this when trying to solve a problem. Tools create silos and siloed teams spend too much time arguing about the nature of reality instead of the problem at hand.

Engage in Constant Conversation with Your Code

We are putting software engineers on call, and empowering them to truly understand their own code in production. We enable and empower engineers to test in prod, experiment with chaos engineering, feature flags, and other modern practices.

Three years ago, this was an active argument in the industry. These battles are over; now we know the only way to build quality services is to empower software engineers to own their code all the way into production. All that’s left is the implementation, which is in progress and will continue to take place over the next decade or so as our industry continues to drive providers to deliver services at scale.

For Engineers … but also Engineering-adjacent Teams

I also think that after mastering this for engineering that builds it, improves it, and maintains it — after getting that tight, virtuous feedback loop of “verifying that what I shipped is behaving the way I expected it to, nothing else looks weird,” and after successfully putting developers on call — the next frontier is exposing real production insights to engineering-adjacent teams. Support, customer service, product managers, and even business owners of those systems stand to gain from a deeper understanding of what’s happening with business-critical applications. Tools create silos — if your team uses one tool, and another team uses a different tool, you don’t share the same view of reality. You will spend a lot of time disagreeing over what you deem reality before even getting to resolving the real issue.

We can empower other teams to do vast amounts of debugging and problem-solving without even involving the engineers. Imagine a templated set of questions for a support team to plug a user-ID into and check to see if the complaint matches a known bug or has already been fixed, or if the complaint is even real, before opening a ticket and escalating to engineers. Imagine all the time you spend on-call digging around in prod to answer questions for other people. Now imagine you don’t have to do any of that.

Everyone wins the closer they get to understanding production.

TLDR … It’s Still about the People.

This battle will be won by whoever can deliver the best end-user experience. As Mike Julian said in his Monitoring and Observability 2019 Predictions, history, social sharing, and learning from each other across disparate teams are necessary to make distributed systems understandable and tractable.

AI and ML are powerful (and possibly even dangerous) tools, but too many organizations are running the risk of putting the horse before the carriage. Any machine can detect a spike, but only a human can tell you if that spike was bad, good, desired, expected, scary. Only a human can derive meaning from numbers.

We believe in Allspaw’s declaration that debugging must forever be a human-centered process. Our aim should be to make it as pleasant and collaborative a process as possible.