The truth about sampling and distributed tracing
by Katy Farmer
There are a number of different ways to sample data when observing microservices. Lightstep, for example, analyzes 100% of unsampled event data. We’ll cover this a bit more later. For now, let’s take a look at the sampling landscape.
Why do some solutions sample distributed trace data? The short answer is that most of the time, not all of the data matters. Developers want to know when outages happen or which patterns are emerging, and they may not need all of the data (which is a lot of data) to do those things. Capturing and analyzing all of the trace data generated by every request throughout a modern distributed system, even a fairly simple and low-volume one, quickly becomes a costly project, and that’s before tallying the additional cost of transmission and storage.
It’s worth understanding a few common sampling methodologies, and the impact they can have on the data that you see.
In head-based sampling (also referred to as “upfront” sampling), a sampling decision will be applied to a single trace when that trace is initiated. Trace data will either be included in the sample set or discarded based on that initial logic, regardless of what’s recorded in the completed trace. This is the most common form of sampling due to its simplicity, but comes with some disadvantages. Head-based sampling selects data on random criteria, which can limit your ability to derive other valuable information from that data because it isn’t able to include context or related trace data.
In tail-based sampling, a sampling decision is made at the end of the workflow in order to make more intelligent decisions, especially for latency measurements, which can only be measured after they’re complete. For example, tail-based sampling might collect anomalous data, like long latency or a rare error, after it evaluates a trace’s properties. However, in order to make a decision at the end of the workflow, the system has to buffer some information, which can increase storage overhead. Tail-based sampling is commonly used for latency analysis because end-to-end latency can’t be computed until the end of the workflow. Often tail-based sampling is done automatically, and sampling determinations are made in collectors or services which decide to sample based on isolated, independent portions of the trace data.
Stratified sampling attempts to capture representative and diverse trace data by separating the data into regions (“strata”) and computing independent samples for each. This sampling strategy helps discover more relevant traces by ensuring that each region of data is well represented in the sample. This approach combines naturally with head- or tail-based sampling, and is often used to find high latency traces when applied as stratified tail-sampling.
Lightstep analyzes 100% of unsampled event data in order to understand the broader story of performance across the entire stack. Unlike head-based sampling, we’re not limited by decisions made at the beginning of a trace, which means we’re able to identify rare, low-fidelity, and intermittent signals that contributed to service or system latency. And unlike tail-based sampling, we’re not limited to looking at each request in isolation: data from one request can inform sampling decisions about other requests. This dynamic sampling means we can analyze all of the data, but only send the information you need to know. Lightstep stores the required information to understand each mode of performance, explain every error, and make intelligent aggregates for what matters most to each developer, team, and organization.
These are the basics of sampling, but any data scientist will tell you that there are much more nuanced and complex ways to sample data. You can read about more types of sampling here. With this foundation, you’ll be better able to understand how your data is analyzed.