I love big cities, I have lived mainly in big cities all through my life until recently. Some people even say I can tolerate large companies because I am a big city girl.
I am currently in New York, this city makes me think about scale. Big city people learn how to deal with scale and the misery of being insignificant in early ages. We also learn to navigate our ways around. If you live in NYC, you care for few parts of the city probably not the entire city. If you are visiting, you may care about walking from Central Park to Times Square. Your experience of the city is pretty limited to what you choose to engage with.
Systems are like cities. Our users engage with a tiny part of the entire stack and that’s what matters from their perspective. That’s how you remember your entire stack or product.
Do you think that you are good in systems? Do you scale in terms of development, production, maintenance? I worked for so many large and small organizations and never thought we did. At my current company, I sort of think we are doing a better job but there are many gaps here and there.
You know that early days of a company or a project is the best. Things are simple. You often have a server and a few other components. Where is your data? You have a tiny Postgres cluster. You are depending on vendor services for minimal concerns like email and SMS.
Your architecture at this point is still fitting into your brain.
You check your logs when there is an outage. When someone joins to the team, you take them to the whiteboard to explain the architecture that fits into your brain.
The next step: growth. More engineers. Your company is growing because your business is legit. The number of common reusable components are increasing, some teams want to deploy more, some don’t.
There are certain conflicts between teams, this is moment where monoliths don’t scale.At this point, it is clear, you want to do something else. You begin to have lots of services and different storage systems.
One problem becomes many. You cannot depend on the earlier ways of doing things. For example, reading logs and debugging become harder. Things sometimes fail in isolation. It becomes harder understand the root causes. Your engineers don’t know where things fail and who to ping when things fail.
And it goes larger. Many nodes, a billion lines of code. Seriously, for example, some companies like mine is large. Who has time to understand anything in such a large company?
Maybe Jeff Dean. Maybe only people who have been around for quite a long time have advantage to understand things end to end. But is it really how we work? Are we depending on Jeff Dean?
At my company, things are looking like a huge mess sometimes. It looks big and complex, who wants to work there in the first place?
At some point, you have to stop and reconsider what you have been doing.
Ok we have this huge complexity, so many lines of code owned by tens of teams. No one knows how this works end to end.
People who have been around for a long time are leaving, they are tired of being the only source of truth.
Don’t tell me documentation. Docs lie, things always change faster than docs.
I will tell you one thing, even if engineers were paid based on how they documented their work, they wouldn’t document.
When I joined Google, on my first hour, they showed me the code search. It is basically a Google for Google source code. It is irreplaceable, it is amazing. But only if you have an entry point. Text, symbol, file or project name you are looking for. If not, good luck. I use code search every day at Google but also learned over time this is the part of it.
The second thing, I was able to see is the dependency graphs from blaze, our internal tool that inspired bazel. We have a unified build tool, we use it everywhere. You can click on a build target, see what it depends and who is depending on it. But again this is part of the story.
Static analysis tools can tell you about your code dependencies, not service dependencies. Your deployment manifests can tell more, but no one likes to analyze deployment configuration files.
None of these tools are not good in pointing out what critical paths and dependencies require the MOST attention. This is how things look like in the wild world. And the blue line is what a true critical path looks like. User requests come to the load balancer, goes through several servers all the way down to low-level disks.
At my company, engineers can see critical paths from any other server. And this ability is the reason why it is possible to learn and debug our systems.
I have been thinking for a while, this way of thinking deserves to have its own paradigm. I want to call it critical path driven development or CPDD for now.
The availability of underlying processes or services are not the main goal. The availability and the experience on the critical paths are. Just like the New York City example. Being able to think about and see your systems from the perspective of users is incredibly a different approach but is also useful.
Some of our engineering practices are based on:
- discovering the critical paths automatically,
- making them reliable and making them fast,
- making them debuggable in production.
If it is the middle of the night, if you are on call, I want you to see everything end-to-end even if you haven’t read the source code before.
There are two main emerging tools in the industry nowadays (finally). We hear them in the context of observability. Event collection and distributed tracing. We use distributed tracing at Google but these are similar tools.
Do you know the golden rule of exploring the cause-and-effect relationships?
It is the ability to ask why for five times. Why? Why? Why? Why? Why?
Events and traces are like having the ability to ask why and going deeper in the stack and exploring the root causes.
This is what a distributed trace looks like if you haven’t seen it before. It is a trace for an HTTP GET to /timeline.
It tells us the latency and the exact components along the way. We can look at these routes to learn about the life of a request. We can also use the data coming from production to debug the issues that are affecting our users.
This is a learning tool. You don’t need to understand the implementation details of everything. We might have more control over our processes. But anything underneath it keeps being a blackbox for most of us. If your infra is providing visibility from the lower layers, it revolutionizes our industry. Being an expert becomes easier.
Imagine Kubernetes was able to tell us about significant events in the lifetime of a user request. We wouldn’t have to learn the Kubernetes internals. We would just look at our traces instead.
More than learning tool, this is a cross-stack debugging tool. You can blame to see whether it is your fault or your infrastructure’s fault?
Here, we spent more time on cloud scheduler which is provided by a cloud provider. It is their fault. You can escalate to their SRE with confidence.
We put much emphasis on traces or events nowadays but they are not the end game, they are more like a beginning. The next step is “Can you tell what source code this span represents?”
Or who to call when this block fails or have unexpected latency?
Or give me to logs, runtime or kernel events, or CPU profiles for this block — whatever else you have. So I can dig in and see why I am seeing latency here.
Nothing comes free and I want to explain some of the everyday challenges first. Never undermine the level of investment required to roll out these technologies at your organization.
- An organizational problem: If you need critical path analysis, it becomes a cross-team problem. The entire organization need to agree on the basics. In order to have end-to-end events or traces, they need to agree on a format to propagate identifiers. The load balancers, proxies or other binaries should be able to honor this format. It is unfortunate that we don’t have a good well-defined industry standard for this yet but a draft is in the works.
- Don’t know where to begin: Most people don’t know where to begin. Start with your network stack, specifically HTTP and RPC. This is where things get easy if you have little fragmentation in terms of frameworks. You can simply instrument frameworks and gather data from there.
- Infra is a blackbox: Infrastructure and vendor services are designed without considering the observability aspects. We still expect people to learn about the underlying stack by reading the code or the manual. We should instead give them visibility along the request as providers.
- Instrumentation is expensive: High traffic systems end up having downsampling and cannot 100% depend on critical path analysis. We sometimes miss collecting data for interesting cases such as 99th percentile because downsampling is usually probabilistic.
- Dynamic capabilities are undermined: The other challenge is how static we used to see instrumentation for a long time in this industry. Dynamic capabilities are important. We ideally want to tweak things and start collecting more data when things go wrong in prod. Being able to do this and do it in a safe way in prod is still not at the reach of everyone.
We are still in the dark ages when it comes to maintaining systems. When I talk about these concepts, I still feel like a snake oil salesman even though I am an engineer benefiting from these concepts almost everyday.
The ability to have this level of visibility is a true differentiator. It make your engineers happy and help them not burn out. This is how you scale in development and in production.
CPDD is a tool that closes knowledge gaps more than any other practice in our field yet we don’t talk about it. If you are building infrastructure or you are providing infra, we should collaborate. Let’s change the status quo, let’s close the knowledge gaps.