Distributed tracing: A complete guide
Track requests across services and understand why systems break
Table of ContentsWhat is distributed tracing?Anatomy of a traceWhy distributed tracing?Distributed tracing considerationsData vs insightProactive solutions with distributed tracingPlanning optimizations: How do you know where to begin?Distributed tracing toolsDistributed tracing vs loggingContinuing to pioneer distributed tracing
Engineering organizations building microservices or serverless at scale have come to recognize distributed tracing as a baseline necessity for software development and operations.
Because distributed tracing surfaces what happens across service boundaries: what’s slow, what’s broken, and which specific logs and metrics can help resolve the incident at hand. Tracing tells the story of an end-to-end request, including everything from mobile performance to database health. Before we dive any deeper, let’s start with the basics.
Distributed tracing refers to methods of observing requests as they propagate through distributed systems. It’s a diagnostic technique that reveals how a set of services coordinate to handle individual user requests. A single trace typically shows the activity for an individual transaction or request within the application being monitored, from the browser or mobile device down through to the database and back. In aggregate, a collection of traces can show which backend service or database is having the biggest impact on performance as it affects your users’ experiences.
In distributed tracing, a single trace contains a series of tagged time intervals called spans. A span can be thought of as a single unit of work. Spans have a start and end time, and optionally may include other metadata like logs or tags that can help classify “what happened.” Spans have relationships between one another, including parent-child relationships, which are used to show the specific path a particular transaction takes through the numerous services or components that make up the application.
- Trace represents an end-to-end request; made up of single or multiple spans
- Span represents work done by a single-service with time intervals and associated metadata; the building blocks of a trace
- Tags metadata to help contextualize a span
The point of traces is to provide a request-centric view. So, while microservices enable teams and services to work independently, distributed tracing provides a central resource that enables all teams to understand issues from the user’s perspective.
Modern software architectures built on microservices and serverless introduce advantages to application development, but there’s also the cost of reduced visibility. Teams can manage, monitor, and operate their individual services more easily, but they can easily lose sight of the global system behavior. During an incident, a customer may report an issue with a transaction that is distributed across several microservices, serverless functions, and teams. It becomes nearly impossible to differentiate the service that is responsible for the issue from those that are affected by it.
Distributed tracing provides end-to-end visibility and reveals service dependencies – showing how the services respond to each other. By being able to visualize transactions in their entirety, you can compare anomalous traces against performant ones to see the differences in behavior, structure, and timing. This information allows you to better understand the culprit in the observed symptoms and jump to the performance bottlenecks in your systems.
Observing microservices and serverless applications becomes very difficult at scale: the volume of raw telemetry data can increase exponentially with the number of deployed services. Traditional log aggregation becomes costly, time-series metrics can reveal a swarm of symptoms but not the interactions that caused them (due to cardinality limitations), and naively tracing every transaction can introduce both application overhead as well as prohibitive cost in data centralization and storage. A strategic approach to observability data ingestion is required. Lightstep was designed to handle the requirements of distributed systems at scale: for example, Lightstep handles 100 billion microservices calls per day on Lyft’s Envoy-based service architecture.
Conventionally, distributed tracing solutions have addressed the volume of trace data generated via upfront (or ‘head-based’) sampling. Conventional distributed tracing solutions will “throw away” some fixed amount of traces upfront to improve application and monitoring system performance. The drawback is that it’s statistically likely that the most important outliers will be discarded. When anomalous, performance-impacting transactions are discarded and not considered, the aggregate latency statistics will be inaccurate and valuable traces will be unavailable for debugging critical issues. Tail-based sampling, where the sampling decision is deferred until the moment individual transactions have completed, can be an improvement. However, the downside, particularly for agent-based solutions, is increased memory load on the hosts because all of the span data must be stored for the transactions that are “in-progress.”
Lightstep analyzes 100% of unsampled event data in order to understand the broader story of performance across the entire stack. Unlike head-based sampling, we’re not limited by decisions made at the beginning of a trace, which means we’re able to identify rare, low-fidelity, and intermittent signals that contributed to service or system latency. And unlike tail-based sampling, we’re not limited to looking at each request in isolation: data from one request can inform sampling decisions about other requests. This dynamic sampling means we can analyze all of the data but only send the information you need to know. Lightstep stores the required information to understand each mode of performance, explain every error, and make intelligent aggregates for the facets the matter most to each developer, team, and organization.
A distributed tracing solution is absolutely crucial for understanding the factors that affect application latency. However, modern applications are developed using different programming languages and frameworks, and they must support a wide range of mobile and web clients. To effectively measure latency, distributed tracing solutions need to follow concurrent and asynchronous calls from end-user web and mobile clients all the way down to servers and back, through microservices and serverless functions.
Distributed traces on their own are just analytical data, much like raw time-series metrics or log files. So even if the right traces are captured, solutions must provide valuable insights about these traces to put them in the right context for the issues being investigated. For example, “when did the end-user response time slow for this customer?” or “did our latest nightly build cause this spike in failures?” Answering these questions requires aggregate trace data analysis on a global scale beyond individual hosts, an understanding of historical performance, and the ability to segment spans without cardinality limitations.
Lightstep automatically surfaces whatever is most likely causing an issue: anything from an n+1 query to a slow service to actions taken by a specific customer to something running in sequence that should be in parallel. Latency and error analysis drill downs highlight exactly what is causing an incident, and which team is responsible. This allows you to focus on work that is likely to restore service, while simultaneously eliminating unnecessary disruption to developers who are not needed for incident resolution, but might otherwise have been involved.
The same way a doctor first looks for inflammation, reports of pain, and high body temperature in any patient, it is critical to understand the symptoms of your software’s health. Is your system experiencing high latency, spikes in saturation, or low throughput? These symptoms can be easily observed, and are usually closely related to SLOs, making their resolution a high priority. Once a symptom has been observed, distributed tracing can help identify and validate hypotheses about what has caused this change.
It is important to use symptoms (and other measurements related to SLOs) as drivers for this process, because there are thousands — or even millions — of signals that could be related to the problem, and (worse) this set of signals is constantly changing. While there might be an overloaded host somewhere in your application (in fact, there probably is!), it is important to ask yourself the bigger questions: Am I serving traffic in a way that is actually meeting our users’ needs? Is that overloaded host actually impacting performance as observed by our users?
In the next section, we will look at how to start with a symptom and track down a cause. Spoiler alert: it’s usually because something changed.
Service X is down. What happened? As a service owner your responsibility will be to explain variations in performance — especially negative ones. A great place to start is by finding out what, if any, changes have been made to the system prior to the outage. Sometimes it’s internal changes, like bugs in a new version, that lead to performance issues. At other times it’s external changes — be they changes driven by users, infrastructure, or other services — that cause these issues.
The next few examples focus on single-service traces and using them to diagnose these changes. While tracing also provides value as an end-to-end tool, tracing starts with individual services and understanding the inputs and outputs of those services. It can help map changes from those inputs to outputs, and help you understand what actions you need to take next.
Perhaps the most common cause of changes to a service’s performance are the deployments of that service itself. Still, that doesn’t mean observability tools are off the hook. Distributed tracing must be able to break down performance across different versions, especially when services are deployed incrementally. This means tagging each span with the version of the service that was running at the time the operation was serviced.
Changes to service performance can also be driven by external factors. Your users will find new ways to leverage existing features or will respond to events in the real world that will change the way they use your application. For example, users may leverage a batch API to change many resources simultaneously or may find ways of constructing complex queries that are much more expensive than you anticipated. A successful ad campaign can also lead to a sudden deluge of new users who may behave differently than your more tenured users.
Being able to distinguish these examples requires both adequate tagging and sufficient internal structure to the trace. Tags should capture important parts of the request (for example, how many resources are being modified or how long the query is) as well as important features of the user (for example, when they signed up or what cohort they belong to).
In addition, traces should include spans that correspond to any significant internal computation and any external dependency. One common insight from distributed tracing is to see how changing user behavior causes more database queries to be executed as part of a single request. Avoid spans for operations that occur in lockstep with the parent spans and don’t have significant variation in performance.
All the planning in the world won’t lead to perfect resource provisioning and seamless performance. And isolation isn’t perfect: threads still run on CPUs, containers still run on hosts, and databases provide shared access. Contention for any of these shared resources can affect a request’s performance in ways that have nothing to do with the request itself.
As above, it’s critical that spans and traces are tagged in a way that identifies these resources: every span should have tags that indicate the infrastructure it’s running on (datacenter, network, availability zone, host or instance, container) and any other resources it depends on (databases, shared disks). For spans representing remote procedure calls, tags describing the infrastructure of your service’s peers (for example, the remote host) are also critical.
With these tags in place, aggregate trace analysis can determine when and where slower performance correlates with the use of one or more of these resources. This, in turn, lets you shift from debugging your own code to provisioning new infrastructure or determining which team is abusing the infrastructure that’s currently available.
The last type of change we will cover are upstream changes. These are changes to the services that your service depends on. Having visibility into your service’s dependencies’ behavior is critical in understanding how they are affecting your service’s performance. Remember, your service’s dependencies are — just based on sheer numbers — probably deploying a lot more frequently than you are. And even with the best intentions around testing, they are probably not testing performance for your specific use case. Simply by tagging egress operations (spans emitted from your service that describe the work done by others), you can get a clearer picture when upstream performance changes. (And even better if those services are also emitting spans tags with version numbers.)
So far we have focused on using distributed tracing to efficiently react to problems. But this is only half of distributed tracing’s potential. How can your team use distributed tracing to be proactive?
The first step is going to be to establish ground truths for your production environments. What are the average demands on your system? With the insights of distributed tracing, you can get the big picture of your service’s day-to-day performance expectations, allowing you to move on to the second step: improving the aspects of performance that will most directly improve the user’s experience (thereby making your service better!).
- Step One: establish ground truths for production
- Step Two: make it better!
The following are examples of proactive efforts with distributed tracing: planning optimizations and evaluating SaaS performance.
Your team has been tasked with improving the performance of one of your services — where do you begin? Before you settle on an optimization path, it is important to get the big-picture data of how your service is working. Remember, establish ground truth, then make it better!
Answering these questions will set your team up for meaningful performance improvements:
- What needs to be optimized? Settle on a specific and meaningful SLI, like p99 latency.
- Where do these optimizations need to occur? Use distributed tracing to find the biggest contributors to the aggregate critical path.
With this operation in mind, let’s consider Amdahl’s Law, which describes the limits of performance improvements available to a whole task by improving performance for part of the task. Applying Amdahl’s Law appropriately helps ensure that optimization efforts are, well, optimized.
What Amdahl's Law tells us here is that focusing on the performance of operation A is never going to improve overall performance more than 15%, even if performance were to be fully optimized. If your real goal is improving the performance of the trace as a whole, you need to figure out how to optimize operation B. There’s no reason to waste time or money on uninformed optimizations.
There are many ways to incorporate distributed tracing into an observability strategy. There are open source tools, small business and enterprise tracing solutions, and of course, homegrown distributed tracing technology.
Open Source Distributed Tracing:
- OpenTelemetry: the next major version of both OpenCensus and OpenTracing supported by the Cloud Native Computing Foundation (CNCF)
Enterprise Tracing Solutions:
- Amazon X-Ray
- Google Cloud Trace
- New Relic
When it comes to leveraging telemetry, Lightstep understands that developers need access to the most actionable data, be it from traces, metrics, or logs. While logs have traditionally been considered a cornerstone of application monitoring, they can be very expensive to manage at scale, difficult to navigate, and only provide discrete event information. By themselves, logs fail to provide the comprehensive view of application performance afforded by traces. This is why Lightstep relies on distributed traces as the primary source of truth, surfacing only the logs that are correlated to regressions or specific search queries.
Lightstep aims to help people design and build better production systems at scale. Ben Sigelman, Lightstep CEO and Co-founder was one of the creators of Dapper, Google’s distributed tracing solution. We’re creators of OpenTelemetry and OpenTracing, the open standard, vendor-neutral solution for API instrumentation.
Equip your team with more than just basic tracing. Lightstep is engineered from its foundation to address the inherent challenges of monitoring distributed systems and microservices at scale. Lightstep’s innovative Satellite Architecture analyzes 100% of unsampled transaction data to produce complete end-to-end traces and robust metrics that explain performance behaviors and accelerate root-cause analysis.