Creating and running an on-call rotation is the very first step in building truly reliable infrastructure at any scale, but it’s far from the last. As a company grows and scales, the systems that it builds and runs grow in complexity and require more sophisticated on-call practices. While there is no universal approach, there are industry best practices for setting up on-call and incident response at any and every size.
In the sections that follow, we take a close look at how to make on-call work at any scale. We’ll examine how to design, support, and empower on-call and incident response for each size, starting with how tiny garage startups with a handful of engineers can run an on-call rotation and making our way up to best practices for companies the size of Amazon, Facebook, or Google.
Even the smallest of startups needs to ensure that their products are readily available to users, and building an on-call rotation that sets the new company up for future success can help them meet this goal. Having an on-call rotation at this scale is essential for understanding whether or not the product is actually working as expected and promised, and for fixing it when it breaks. It’s all well and good for the servers to go down when the only users are friends and family, but after a customer has paid for the service and when investors want to know about the service’s performance, letting the servers go down for two hours during someone’s date night is not going to cut it.
Startups bootstrap their on-call rotations out of three primitives. The first is a notification tool that will notify them when there is a problem, like PagerDuty. The second is a collaboration tool engineers can use for communication during outages, such as Slack, Hipchat or IRC. The final primitive is a monitoring system to know when there is a problem, which can be a bit harder to set up than the first two so we’ll take a closer look at it.
Good, accurate monitoring is made up of three pieces: healthchecks, metrics, and dashboards. Heathchecks should be run against both services and servers, verifying they are reachable and responding. Many startups have unfortunately managed to create high-availability, responsive, and scalable ways to serve error pages, so true confidence that the business is operating normally requires collecting metrics. The next step is to create dashboards of these metrics so that engineers can visually verify the system is behaving normally (and spot correlated events, like a botched deploy, when it is not). These needs can be met by a variety of hosted and open-source options. On the hosted side, various platforms such as Datadog provide healthchecks, metrics and dashboards. If a company decides to go the open-source route, a combination of tools such as Nagios and Graphite will do the trick.
There are significant advantages to being small. When something does go wrong, it tends to be quite clear who broke what, and the entire team is familiar enough with the system to be able to fix problems. The business impact of something breaking is also smaller: early adopters have a higher tolerance for outages, and the absolute number of users affected is much, much lower than when operating at scale. The culture and habits adopted when the company is established will grow along with the company it as it scales, making this the best window of opportunity to foster and evolve a culture of reliability.
However, being scrappy is not without its disadvantages. Hardening alert configurations and expanding them to detect a wide array of potential outages is labor intensive work, and it’s an ongoing challenge to prioritize better monitoring over feature development. The consequences of a small team expand beyond prioritization, and providing 24/7 coverage with only two or three engineers can be a punishing experience.
As a company grows, so do the demands of its infrastructure, its operational workload, and its on-call and incident response. For many companies, this is when dozens of servers grow to hundreds, and, in the interests of scalability, a monolithic service may be broken into multiple microservices.
The engineering team will be large enough at this point to expand the on-call rotation; a sustainable rotation might require, for example, eight engineers who run a two layer rotation while restricting time on-call to one week per month. It is critical that incidents are routed to the rotation–and not consistently bypassed to be solved by a subset of experienced engineers–to avoid burning out key engineers, expand understanding of the systems, and train the seed members for expanding to multiple rotations. Moving from a shared on-call rotation to multiple rotations is a key scaling challenge, requiring active attention to ensure practices remain healthy and consistent.
The consequences of not having business continuity become frighteningly real, leading to an investment in disaster recovery to be able to recover from the dire and unexpected. More than a one-time implementation, recovery must be validated at least on an annual basis, initially as a best practice, and eventually for compliance purposes.
At this size, it’s no longer effective to inspect logs on individual servers, and it’s time to adopt a centralized log aggregator, such as ELK or Splunk. Similarly, an error aggregator (like Sentry) will make it easier to understand where errors are coming from. Deployment tooling and strategies should be expanded to support easy rollbacks, and broken deployments should be rolled back rather than repaired in the production environment.
The tooling adopted at this phase is the same tooling deployed at companies that have scaled to thousands of engineers and major revenue. Despite this fact, the infrastructure footprint is still small enough to allow for major shifts, such as moving from a physical datacenter to the cloud, or from “manual orchestration” to programmatic orchestration like Kubernetes or Mesos. Making large-scale changes to the infrastructure becomes more difficult when the company grows larger, so companies at this scale have a unique opportunity to quickly test out new infrastructure and set themselves up for future scalability and reliability—an opportunity that won’t be there for long.
The process surrounding on-call will start to evolve as well. Incident classification is the practice of classifying incidents according to their user impact, often labeling a minor incident as “Level 4” and a widespread outage as “Level 1.” Postmortems to discuss the timeline and impact of large incidents will create a feedback loop to guide learning about system reliability: Etsy’s Debriefing Facilitation Guide describes how to run healthy postmortems that create a permanent culture of learning from failures.
Managing an incident will become complex enough to require two or more individuals collaborating, and clearly defined roles and responsibilities makes collaboration during stressful times easier. In particular, defining the role of incident commander, who is not involved in debugging but instead coordinates communication is valuable. One approach to incident commanding is covered in Pagerduty’s Incident Response Documentation.
Sustainable rotation comprised of at least eight engineers.
Multiple rotations make it likely the person paged can debug the impacted system.
Disaster recovery will make it possible to recover from the dire and unexpected.
Error aggregator to track the source of errors, like Sentry.
Rollback broken deployments, instead of repairing in production.
Incident classification will clarify the user impact of outages.
Postmortems after large incidents will create a feedback loop to guide learning about system reliability.
Incident commanders who handles coordination but don’t debug.
As a company grows and its business becomes increasingly valuable, the stress created by incidents increases in tandem. There will be engineers at the company who remember back when two hours of downtime was smoothed over with a company T-shirt to the one affected customer, while now downtime can cost tens of thousands of dollars per minute.
Fortunately, as a larger company there are more resources and tools to offset the stress of being on-call. The first–obvious, but greatly underutilized–is explicitly training engineers about on-call process and tooling. Beyond a presentation, the best training involves running “game days“ with manually triggered failures or generated load to create realistic opportunities for debugging.
At this point, each team should be responsible for the software they write, with a backstop rotation of your most experienced engineers for exceptional circumstances. An effective combination of author ownership with a backstop is necessary, as increasingly no individual engineer has a complete mental map of how the entire system works. There are some exceptions—Airbnb, for example, relies on an elite volunteer rotation for their backstop, and Slack continues to rely on their operations team for primary on-call.
This is also when there are enough resources to selectively move beyond generic open-source solutions, starting to roll out specialized tooling. This includes request tracing tooling such as Zipkin or Lightstep, which will greatly simplify debugging, particularly in environments where a monolithic service has been unwound into a swarm of microservices. It’s also time to write an ownership metadata service that makes it possible to have a centralized source of truth for who owns what; this metadata will be the backbone of routing alerts, rolling out cost accounting and frantic lookups during outages. Moving past a wiki documenting the steps to take during an outage, an incident registration tool to coordinate the incident process from a single place: sending alerts and notifications in chat, and filing a ticket to create a postmortem. Most companies start making a major investment in runbooks, ideally including a link to a runbook in every alert that is sent.
In addition to tools for managing incidents, the disruption and financial impact will drive increasing focus on avoiding incidents to begin with. This includes more rigorous deployment strategies, such as deploying to a small number of canary hosts before deploying widely, and incremental deployments that abort automatically if error rates elevate unexpectedly. Software will be automatically tested in a quality assurance or staging environment, running integration tests against every deployment before it is promoted to your production environment.
For outages that can’t be prevented entirely, there are mechanisms to limit the blast radius, such as rate limiters which shed excessive traffic and circuit breakers to prevent cascading failures from failing downstream services.
Process should explicitly include steps to communicating externally, building trust with customers and users by giving timely visibility into issues. Some companies, like Gitlab, take this to the extent of providing public postmortems. At this phase, it’s important to have clearly defined reliability metrics to measure and improve against, particularly for PaaS and SaaS companies.
Some challenges are exacerbated by size, and alert escalations across teams–particularly for legacy systems–will become increasingly frequent and a source of contention.
Training for engineers on on-call process and tooling.
“Game days“ to create realistic opportunities for debugging.
Backstop rotation of your most experienced engineers for exceptional circumstances.
Ownership metadata service maintaining a centralized source of truth for who owns what.
Incident registration tool to coordinate the incident process from a single place.
Runbooks for common issues, linked to every delivered alert.
Canary hosts to validate software in production before deploying widely.
Incremental deployments that abort automatically if error rates elevate unexpectedly.
Staging environment that automatically test software before it reaches production.
Rate limiters to shed excessive traffic.
Circuit breakers to prevent cascading failures from failing downstream services.
Communicating externally to build trust with customers and users.
Reliability metrics to measure and improve against.
At this scale a company has become truly massive, moving from a handful of on-call rotations to dozens of them. Outages are immensely expensive, and when one does occur, prioritize recovery over understanding, as it’s too expensive to debug in production.
Servers or VMs are sufficiently numerous that even acknowledging alerts issued from individual servers is infeasible, requiring investment into alert deduplication, with each root cause triggering a single alert. The frequency of untuned alerts firing will overwhelm on-call engineers, leading to tracking alert metrics, which guide tuning and fixing alerts that fire frequently: you aim for every alert to be an actionable alert.
Failure isn’t unusual, it’s a constant. Four servers breaking will go from an exceptional day to a normal Tuesday; switches and entire datacenters will start failing with frequency. Don’t be content with entropy—start to actively inject failures into systems to ensure they fail properly using tools like Chaos Monkey. Because failures are too frequent to remediate manually, deprecate runbooks and move towards automated remediation as widely as possible. A multi-region or multi-datacenter strategy makes it possible to continue providing service, even under the most dire circumstances. As a consequence of this investment into tooling, humans can increasingly delegate on-call responsibilities to computers, only becoming involved under exceptional circumstances.
Things change behind the technology as well. 24/7 coverage is replaced with a follow the sun rotation to reduce midnight calls, increasing quality of life, and also reducing time to remediation. An executive sponsor for reliability talks about why reliability matters at company all-hands meetings. Incidents become frequently enough to hire a technical program manager to coordinate the postmortem process.
Recovery over understanding, it’s too expensive to debug in production.
Alert deduplication so each root cause triggers a single alert.
Alert metrics to guide tuning and fixing alerts that fire frequently.
Actionable alerts are the only alerts.
Inject failures into systems to ensure they fail properly, using tools like Chaos Monkey.
Deprecate runbooks and move towards automated remediation.
Multi-region provides real-time business continuity.
Follow the sun rotation to reduce midnight calls.
Executive sponsor for reliability to champion efforts.
Technical program manager to coordinate the postmortem process.
Companies at this size operate in rarefied air, having enough engineers to build any tool imaginable, and to automate away as much of the operational workload as possible. The overall system’s complexity is beyond comprehension of any individual, causing the benefits of having written part of the system to be surpassed by the benefits of experience with on-call tooling. Consequently, it’s time to move away from the previously effective “who wrote it, is on-call for it” model to a centralized on-call, operated by specialized engineers.
True automation of the operational workload for stable parts of the system.
Centralized on-call operated by specialized engineers.