by Nic Munroe
Monolithic service architectures for large backend applications are becoming increasingly rare. The monoliths are being replaced with distributed system architectures, where the backend application is spread out (distributed) in an ecosystem of small and narrowly-focused services. These distributed services communicate with each other over the network to process requests.
There are many benefits to distributed system architectures, but we’re not trying to sell you on them in this blog post. A quick microservice vs. monolith Google search can do that. Instead, we’re going to focus on the difficulty of tracking and analyzing requests in a distributed environment; pinpointing problems can be very frustrating when requests might touch dozens of services during processing.
This is where Distributed Tracing (DT) comes in. DT tackles these problems in a very practical way and can make debugging certain thorny issues relatively easy and straightforward, even with highly-distributed systems.
But this post isn’t a deep-dive into using DT to debug issues either. Distributed tracing is great, but it loses a lot of its value if it’s not implemented properly in your services. DT adoption in the industry is spotty. It’s easy to do things slightly wrong and not realize it. The biggest challenge, however, is often in understanding core DT concepts. Teams quickly become self-sufficient and able to take advantage of DT’s benefits once that understanding is reached. Further exacerbating this challenge is the initial appearance of DT, which can seem highly technical and complex at first glance. DT core concepts are not highly technical or complex in reality, but their appearance can lead to a mental block when learning about them.
That is what this post is about: an attempt to boil down and demystify distributed tracing so its core concepts can be quickly absorbed by anybody, regardless of skill, experience, or interest level in DT. Because let’s face it — we’re all incredibly busy, with wide-ranging skillsets and more work to do than we have time to do it. Something like distributed tracing can be a hard sell if you have to go read a ton of documentation to try and grasp what it does and how it might benefit you (let alone what it would take to implement it in your services).
Also, since distributed tracing tries to be unobtrusive by design and not break your services if something is wrong, it can be difficult to know if you’ve completely and correctly implemented DT unless you have a solid foundation in the core distributed tracing concepts. The desire and capability to implement DT, and fix it when Murphy’s law inevitably strikes, should hopefully come naturally once you have a firm handle on the core concepts.
Distributed Tracing is the process of tracking and analyzing what happens to a request (transaction) across all services it touches.
That’s the essence of DT when you boil it down. There’s a lot of handwaving-the-details-away in that definition, but this post is about DT core concepts, and conceptually, that’s what DT is. DT is fundamentally about tracking and analyzing requests as they bounce around distributed architectures as a whole. This is in contrast to traditional monitoring that focuses on each service as an individual in isolation and where specific request details are lost in favor of aggregate metrics.
What does “tracking” and “analyzing” mean?
- “Tracking” means generating the raw data in each service that says, “I did some processing for a request with Trace ID
abc123— here's what I did, what other services I talked to, and how long each chunk of work took."
- “Analyzing” means using any of the various searching, aggregation, visualization, and other analysis tools that help you make sense of the raw tracing data.
Don’t get too bogged down in the details, however. This definition of distributed tracing is intentionally light on the details because there are many different DT implementations and a wide array of libraries and tools to help you track and analyze DT data. We don’t want to get distracted by the nuts and bolts yet.
For now, just keep in mind that distributed tracing is conceptually pretty straightforward: it’s about understanding how a request is processed as it hops from service to service.
Here are a few of the critical questions that DT can answer quickly and easily that might otherwise be a nightmare to answer in a distributed system architecture:
- What services did a request pass through? Both for individual requests and for the distributed architecture as a whole (service maps).
- Where are the bottlenecks? How long did each hop take? Again, DT answers this for individual requests and helps point out general patterns and intermittent anomalies between services in aggregate.
- How much time is lost due to network lag during communication between services (as opposed to in-service work)?
- What occurred in each service for a given request? At Nike, we tag our service log messages with the given request’s Trace ID. This allows us to easily find all log messages associated with a particular request across all services it went through (when combined with a log aggregation and search tool).
Side note: It’s debatable whether tagging logs with Trace IDs should be considered a canonical part of DT, but it’s an easy next step once your services are instrumented for DT. We’ve found it so critically useful that it’s hard to think about DT without including that tagging as part of it.
Distributed tracing is especially helpful on difficult-to-reproduce or intermittent problems. Without DT, the info might be lost forever or be so difficult to unearth that you’d never find it. When it’s 3:00 a.m. and an alert is going off, and you’re on-call, being able to use DT to quickly point to the not-my-service culprit is invaluable. Distributed tracing can often let you go back to bed in a few short minutes or at least point you in the right direction for why the problem is in your service. This saves you hours of guesswork, following red herrings, and painful debugging.
Distributed tracing is extremely useful even for a single service where upstream and downstream haven’t implemented DT. You’ll still be able to answer the question of how much time was spent in your service vs. waiting for outbound calls to complete. If an outbound call is especially slow, then you’ll know where to point the finger when someone asks why their request was laggy. These benefits apply even if nobody else is doing DT. So don’t wait for everyone around you to implement distributed tracing — do it for your services now, and reap significant benefits immediately.
In order to discuss the core concepts for how distributed tracing works, we first need to define some common nomenclature and explain the anatomy of a trace. We use Dapper-style tracers at Nike, based on the Google Dapper paper, so the main entities are
Span. Note that distributed tracing has been around for a long time, so if you research DT you might find other tools and schemes that use different names. The concepts, however, are usually very similar:
Tracecovers the entire request across all services it touches. It consists of all the
Spansfor the request.
Spanis a logical chunk of work in a given
- Spans have parent-child relationships. This is a very important concept that is easy to miss if you’re new to DT.
The parent-child relationship is important enough to highlight a few details:
- A “child span” has one span that is its “parent.”
- A “parent span” can have multiple “child spans.”
- This parent-child relationship allows you to create a “trace tree” out of all the spans for a request.
- The trace tree always has one span that does not have a parent — this is the “root span.”
Trace is started, a
Trace ID is generated that follows the request wherever it goes. A new
Span is generated for each logical chunk of work in the request, which contains the same
Trace ID, a new
Span ID, and the
Parent Span ID (which points to the span ID of the new span's logical parent). This
Parent Span ID is what creates the parent-child relationship between spans. Spans also contain start timestamps and durations, so they can be placed on a timeline. This is all standard procedure for any Dapper-style distributed tracing scheme and is shown visually in the images above.
For a given microservice and a given request, this usually manifests as one span for the overall request and one child span for each outbound call made to another service, database, etc, as part of that request.
In a distributed tracing scheme, there needs to be some kind of “span collector” that gathers the span data from the various services in a distributed systems architecture. This is necessary because spans in the same trace come from all over your network, and the primary thing you want to do during DT analysis is group spans by Trace ID so that you can see the whole request/operation/transaction at once. The span collector’s primary job is to provide these grouping and searching features, allowing you to discover interesting traces and search for all the span data related to a given trace in order to make sense of it.
At Nike, we output the span data to log messages in a specific predictable format, where it will be picked up by our log aggregator for later searching and analysis. The log aggregator acts as our span collector. This is a bit unusual — many schemes send span data to a purpose-built distributed tracing span collector and visualization UI, such as the server component of the open source Zipkin system or the proprietary collector and visualization UI provided by a DT vendor.
Using logs as your span collector has some benefits and drawbacks. Logs are easy to employ, and you don’t need to operate and maintain another system for the span collection. If you have a good log aggregator, this can be very powerful when searching for interesting traces by just about any criteria. On the other hand, making sense of a complex trace using raw logs can be painful and time-consuming. Visualization is remarkably helpful, which is why we created an internal tool to export span data from our log aggregator to an ephemeral Zipkin server for ad-hoc, on-demand visualization (we hope to open source that
log scraping → visualization tool at some point).
You should give some thought to how you’ll accomplish visualization, especially if you decide to go with logs as your span collector. For some debugging scenarios, there is no substitute for good visualization. Visualization is also valuable for helping those not steeped in distributed tracing make sense of the data their services are producing.
As mentioned earlier, at Nike we go one step further with our logs: every log message generated as part of processing a request is tagged with the request’s trace ID — even raw logs not related to DT. Our log aggregator automatically picks up the logs from all services and makes them searchable. Finding all logs associated with a specific request, across all services it touched, is then a simple matter of copy/pasting a trace ID into a search box.
This is incredibly powerful when debugging issues for a given request, and I cannot personally recommend it highly enough. In my opinion, having raw log messages tagged with trace IDs is as important from a practical day-to-day standpoint as the latency analysis DT gives you. I believe this should go hand-in-hand with any DT solution you pick up, even if it means a little extra work to make it happen beyond what the DT solution provides.
You don’t need to use logs as your span collector to benefit from this — the real win is tagging raw service log messages with trace IDs.
In order to understand how distributed tracing is implemented at a service level, it’s helpful to imagine a three-legged stool. Each leg is required for a three-legged stool to do its job. Distributed tracing works the same way: there are three distinct “
Legs" that must be implemented in order for DT to work. If any one of those
Legs is missing or broken, then distributed tracing can't do its job.
An important side note before we get to the three-legged stool analogy: libraries already exist for instrumenting your services for distributed tracing. You generally do not need to implement DT yourself, even if it appears fairly straightforward. This section is here to help you understand the concepts, which will make instrumentation and validation of DT in your services significantly easier. You’ll have a good understanding of what’s happening behind the scenes and what to look for, rather than blindly copy/pasting some incantation you find on Stack Overflow and hoping it works. This three-legged stool analogy should also give you a fighting chance when something goes wrong with DT and you need to debug and fix it. See the resources section at the end of this post for pointers in the right direction once you’re ready to get started on finding and implementing a DT solution.
When your service first starts processing a request, it should inspect the request for tracing information (tracing-related headers on an HTTP request, for example). If no tracing information is present, then your service should start a new trace with a new root span. Otherwise, when tracing info is present, your service should create a child span that continues the incoming trace.
- For a new trace, the root span would have a new trace ID, new span ID, and no parent span ID.
- For a child span continuing a trace, it would have the same trace ID as the incoming request, a new span ID, and a parent span ID that points to the incoming request’s span ID.
For HTTP requests, this behavior is usually implemented as a request/response filter so that it happens automatically, and you don’t have to remember to do it for each service endpoint. A Servlet filter would be a common solution in many Java HTTP server frameworks, for example. Most server frameworks have a similar request/response filter mechanism, regardless of language or stack.
You should also strongly consider returning the Trace ID as a response header so that callers can inspect the response and copy/paste the Trace ID for log searching or trace visualization. This is another of those things that isn’t necessarily canonical DT, but we’ve found it so useful that it’s become an integral part of how Nike does DT.
When your service makes an outbound call to a different service, you should first surround that outbound call with a child span. Then, ensure the child span’s tracing information is propagated with the call so that the receiving service can continue the trace as described in
Tracing info propagation is usually done via HTTP headers in HTTP calls, but other types of calls may use other mechanisms (message attributes for AWS SNS/SQS queue messages, for example). Use request/message metadata when you can to avoid tightly coupling your business payloads to a specific tracing implementation. If no such metadata exists, then you can embed tracing info directly in the data payload as a last resort, as long as the receiver knows how to extract it.
Different DT schemes may have different specifications for how they recommend you propagate tracing info on outbound calls (depending on the protocol), so you’ll need to follow the advice of whatever DT tooling you happen to use. The important thing is that the sender and receiver are using the same rules and names for the tracing info so the receiver can successfully find the propagated tracing info and continue the trace.
For a concrete example, we use the Zipkin/B3 spec at Nike when propagating tracing info over HTTP calls. This means we would pass
X-B3-ParentSpanId request headers on outbound HTTP calls (as specified by the B3 rules), and the receiving HTTP server would extract them and continue the trace appropriately.
Note: There are cases where you might make an outbound call to a service or data store that doesn’t support DT, and you don’t need to (or can’t) supply propagation info. You should still surround that outbound call with a child span in your service because it allows you to see how much time was spent waiting for the call to return when you’re visualizing that trace.
Leg 1 involves creating the appropriate span for the request in your service.
Leg 2 involves surrounding outbound calls with a child span and passing that child span info to the service you're calling.
Leg 3 is the process of getting the span from
Leg 1 →
Conceptually, this is simple. In practice, it varies widely and depends on language, stack, and request-processing patterns.
Leg 3 is, therefore, sometimes a difficult
Legto get right.
For example, in Java it’s common to embed the current span for a request in a
ThreadLocal so that the span doesn't need to be passed around explicitly, but it can still be accessed at any time by request-processing logic running on the same thread. This lets DT be an auto-magic feature that "just works" in thread-per-request frameworks. It also lets us automatically tag log messages with the relevant trace ID (using the SLF4J MDC, for those curious about exactly how that's accomplished). For many frameworks and usage scenarios, this allows DT to be an invisible feature to most devs, where they get all the benefit without any direct DT interaction once things are set up properly. This works great until your request-processing needs to hop threads, at which point
Leg 3 breaks and DT fails. In reactive non-blocking frameworks and libraries, this thread-hopping is the norm, not the exception. This means that supporting
Leg 3 in Java can get ugly, depending on what you're doing.
So in Java,
Leg 3 usually boils down to making sure the current span hops threads when request-processing is asynchronous. In some cases, this can be done automatically when a given async library or framework exposes the necessary hooks for thread-hopping. In other cases, where the hooks don't exist (or are too restrictive), then you'll need to manually make the span hop threads yourself. There are usually helpers you can use to make the manual process much easier and less error-prone. You can see some examples of these manual thread-hopping helpers in the readme for the Java 8 module of Nike's Wingtips libraries.
Different distributed tracing libraries will have different recommended ways of tackling
Leg 3. Ultimately, any solution is acceptable, as long as you can successfully get the span information from the start of request-processing (
Leg 1) to the outbound calls (
Again, this is a straightforward concept, but can be difficult to accomplish in practice for a variety of reasons. When you run into problems with this
Leg, keep your eye on the simplicity of the concept — it can help guide you as you debug the issue.
Ideally, distributed tracing instrumentation for a service is something done at the framework and library configuration level and should be invisible to developers on a day-to-day basis.
Leg 1 and
Leg 2 are often fully automatic and, depending on how restrictive a developer's environment is, sometimes DT can be fully hidden in a service.
Leg 3, in particular, exposes one of the dangers of trying to completely hide distributed tracing from service developers as a feature they "shouldn't have to worry about." While it would be nice if you could guarantee DT would always work, in reality it tends to break as soon as developers do something outside the walled garden where DT has been configured to automatically work. In practice, this can happen on a regular basis as devs pull in new libraries or frameworks that haven't been configured for DT. Teams must be armed with some basic knowledge of how DT works and where the limitations are for auto-magic instrumentation in the languages, frameworks, and libraries they're using. Otherwise, DT tends to get broken by accident, and nobody notices until it's already been pushed to production.
One of the main purposes of this blog post is to provide that basic DT knowledge so devs can (1) avoid breaking DT in the first place, and (2) know where to look to debug and fix DT quickly, if and when something does go wrong.
Once you have the three legs implemented in your service, it’s time to verify that your service is handling distributed tracing correctly and completely.
Trace visualization is highly recommended for DT verification. Problems usually jump out immediately.
DT correctness verification starts with some quick local debugging to ensure all parts of the three-legged stool are working correctly in a given service. Next, verify that cross-service trace propagation is working. This can be done by inspecting raw span data or consulting your DT visualization for a given trace. You should be able to see that the same trace ID is used across upstream and downstream services for the same request, and you should be able to follow the parent → child chain using span ID and parent span ID values.
DT completeness verification revolves around making sure that a given service’s overall request span durations are completely accounted for. Many microservices don’t do much time-consuming work themselves. Usually, the entire duration of the request in a service is spent waiting for outbound calls to complete. If your service falls into this category, then DT completeness verification is fairly straightforward. As mentioned earlier, the general pattern for DT in a service is to have one overall request span as well as individual child spans around each outbound call. Since each span contains a start timestamp and duration info, you should be able to line up the child spans and see no “gaps” in the overall request span. If you have large gaps where the overall request span isn’t being covered by a child span, then you’re either doing some serious number crunching in your service, or (more likely) you have an outbound call that is not being surrounded by a child span.
If distributed tracing isn’t working in a given service, then one of the
Legs of the three-legged DT stool is broken in that service. Which leg is broken is often evident from looking at logs or visualizations, but you should be able to verify with some simple debugging. Debugging would involve making sure that the overall request span is started (
Leg 1), reaches the spots where outbound calls happen (
Leg 3), and is propagated downstream on those outbound calls after surrounding with a child span (
Search around once you’re ready to implement distributed tracing. You’ll find libraries and tools for DT in almost any language and stack/framework you want. Here are a few links to get you started:
- Zipkin is always a good first stop, as it has support for a wide variety of languages and frameworks and a thriving community.
- Jaeger is another distributed tracing solution worth checking out. It is inspired by Dapper and Zipkin and has support for several languages, backends, and use cases.
- OpenCensus is a distributed tracing and metrics library that implements the
Three Legsfor a variety of use cases and supports Zipkin's defacto-standard B3 wire format by default. Plugins are used to allow for swappable data formats and backends.
- OpenTracing promises vendor-neutral APIs for instrumentation.
- I’ll also take this opportunity for a shameless self-promotional plug of Nike’s open source Wingtips libraries, which is a set of Zipkin-compatible libraries for the JVM and has support for some common frameworks and clients.
Do some research, and find the solution that’s right for your needs. Keep in mind what you’ll need to do per service for each of the three
Legs discussed earlier as well as how you want to collect and visualize span data.
Distributed tracing is critical to operating, maintaining, and debugging services in a distributed systems architecture. DT directly translates into providing the best possible experience for consumers and developers alike. It provides deep insight into individual services as well as into how they communicate and work together as a whole. Distributed tracing makes the lives of everyone — from product owners, to prod support, to service developers — much easier and less frustrating.
Hopefully this post has given you some practical insight into distributed tracing core concepts and made you eager to implement DT in your services. Or, if you have DT but it’s incomplete or not working quite right, then hopefully you now feel armed with the necessary knowledge to verify or fix/improve DT in your services.
Leave some comments if you have questions — otherwise, happy tracing!