Engineering organizations building microservices or serverless at scale have come to recognize distributed tracing as a baseline necessity for both software development and operations. But all distributed tracing systems are not created equal: the traces are “the fuel, not the car,” and an organization must consider everything – from analytical capabilities to portability to scale – when selecting a distributed tracing solution.
Before we get to those considerations, though, let’s start with the basics.
Distributed tracing refers to methods of observing requests as they propagate through distributed systems. A single trace typically shows the activity for an individual transaction or request within the application being monitored, all the way from the browser or mobile device down through to the database and back.
A trace contains a series of tagged time intervals called “spans,” which are often at the granularity of RPCs or other cross-process API calls. A span can be thought of as a single “unit of work” – it has a start and end time, and optionally may include logs or tags that can classify “what happened” during the span. Spans have relationships between one another, which are used to show the specific path a particular transaction takes through the numerous services or components that make up the application.
Modern software architectures built on microservices and serverless introduce advantages to application development, but there’s also the cost of reduced visibility. Teams can manage, monitor, and operate their individual services more easily, but they can easily lose sight of the global system behavior. During an incident, a customer may report an issue with a transaction that is distributed across several microservices, serverless functions, and teams. It becomes nearly impossible to differentiate the service that is responsible for the issue from those that are affected by it.
Distributed tracing provides end-to-end visibility and reveals service dependencies – showing how the services respond to each other. By being able to visualize transactions in their entirety, you can compare anomalous traces against performant ones to see the differences in behavior, structure, and timing. This information allows you to better understand the culprit in the observed symptoms and jump to the performance bottlenecks in your systems.
Observing microservices and serverless applications becomes very difficult at scale: the volume of raw telemetry data can increase exponentially with the number of deployed services. Traditional log aggregation becomes costly, time-series metrics can reveal a swarm of symptoms but not the interactions that caused them (due to cardinality limitations), and naively tracing every transaction can introduce both application overhead as well as prohibitive cost in data centralization and storage. A strategic approach to observability data ingestion is required. LightStep [x]PM was designed to handle the requirements of distributed systems at scale: for example, [x]PM handles 100 billion microservices calls per day on Lyft’s Envoy-based service architecture.
Conventionally, distributed tracing solutions have addressed the volume of trace data generated via upfront (or ‘head-based’) sampling. Conventional distributed tracing solutions will “throw away” some fixed amount of traces upfront to improve application and monitoring system performance. The drawback is that it’s statistically likely that the most important outliers will be discarded. When anomalous, performance-impacting transactions are discarded and not considered, the aggregate latency statistics will be inaccurate and valuable traces will be unavailable for debugging critical issues. Tail-based sampling, where the sampling decision is deferred until the moment individual transactions have completed, can be an improvement. However, the downside, particularly for agent-based solutions, is increased memory load on the hosts because all of the span data must be stored for the transactions that are “in-progress.”
Heterogeneous full-stack environments
A distributed tracing solution is absolutely crucial for understanding the factors that affect application latency. However, modern applications are developed using different programming languages and frameworks, and they must support a wide range of mobile and web clients. To effectively measure latency, distributed tracing solutions need to follow concurrent and asynchronous calls from end-user web and mobile clients all the way down to servers and back, through microservices and serverless functions.
Distributed traces on their own are just analytical data, much like raw time-series metrics or log files. So even if the right traces are captured, solutions must provide valuable insights about these traces to put them in the right context for the issues being investigated. For example, “when did the end-user response time slow for this customer?” or “did our latest nightly build cause this spike in failures?” Answering these questions requires aggregate trace data analysis on a global scale beyond individual hosts, an understanding of historical performance, and the ability to segment spans without cardinality limitations.
[x]PM enables you and your team to see performance as histograms across your entire system rather than oversimplified statistics. Latency histograms let you discover, categorize, and isolate distinct performance behaviors. Historical context, even for tags and operations you never thought to monitor, makes it easy to understand what’s normal and what’s not.
LightStep aims to help people design and build better production systems at scale. Ben Sigelman, LightStep CEO and Co-founder was one of the creators of Dapper, Google’s distributed tracing solution. We’re creators of OpenTracing, the open standard, vendor-neutral solution for API instrumentation. And we provide our own commercial solution for enterprises that want to harness the power of comprehensive trace data.
Equip your team with more than just basic tracing. [x]PM is engineered from its foundation to address the inherent challenges of monitoring distributed systems and microservices at scale. [x]PM’s innovative Satellite Architecture analyzes 100% of unsampled transaction data to produce complete end-to-end traces and robust metrics that explain performance behaviors and accelerate root-cause analysis.