Lightstep from ServiceNow Logo

Products

Solutions

Documentation

Resources

Lightstep from ServiceNow Logo
< all blogs

The Truth About Sampling and Distributed Tracing

There are a number of different ways to sample data when observing microservices. Lightstep, for example, analyzes 100% of unsampled event data. We’ll cover this a bit more later. For now, let’s take a look at the sampling landscape.

Why Sample Distributed Trace Data?

Why do some solutions sample distributed trace data? The short answer is that most of the time, not all of the data matters. Developers want to know when outages happen or which patterns are emerging, and they may not need all of the data (which is a lot of data) to do those things. Capturing and analyzing all of the trace data generated by every request throughout a modern distributed system, even a fairly simple and low-volume one, quickly becomes a costly project, and that’s before tallying the additional cost of transmission and storage. 

It’s worth understanding a few common sampling methodologies, and the impact they can have on the data that you see. 

Head-based sampling

In head-based sampling (also referred to as “upfront” sampling), a sampling decision will be applied to a single trace when that trace is initiated. Trace data will either be included in the sample set or discarded based on that initial logic, regardless of what’s recorded in the completed trace. This is the most common form of sampling due to its simplicity, but comes with some disadvantages. Head-based sampling selects data on random criteria, which can limit your ability to derive other valuable information from that data because it isn’t able to include context or related trace data. 

Tail-based sampling

In tail-based sampling, a sampling decision is made at the end of the workflow in order to make more intelligent decisions, especially for latency measurements, which can only be measured after they’re complete. For example, tail-based sampling might collect anomalous data, like long latency or a rare error, after it evaluates a trace’s properties. However, in order to make a decision at the end of the workflow, the system has to buffer some information, which can increase storage overhead. Tail-based sampling is commonly used for latency analysis because end-to-end latency can’t be computed until the end of the workflow. Often tail-based sampling is done automatically, and sampling determinations are made in collectors or services which decide to sample based on isolated, independent portions of the trace data.

Stratified sampling

Stratified sampling attempts to capture representative and diverse trace data by separating the data into regions (“strata”) and computing independent samples for each. This sampling strategy helps discover more relevant traces by ensuring that each region of data is well represented in the sample. This approach combines naturally with head- or tail-based sampling, and is often used to find high latency traces when applied as stratified tail-sampling.

Dynamic Sampling at Lightstep

Lightstep analyzes 100% of unsampled event data in order to understand the broader story of performance across the entire stack. Unlike head-based sampling, we’re not limited by decisions made at the beginning of a trace, which means we’re able to identify rare, low-fidelity, and intermittent signals that contributed to service or system latency. And unlike tail-based sampling, we’re not limited to looking at each request in isolation: data from one request can inform sampling decisions about other requests. This dynamic sampling means we can analyze all of the data, but only send the information you need to know. Lightstep stores the required information to understand each mode of performance, explain every error, and make intelligent aggregates for what matters most to each developer, team, and organization.

Summary

These are the basics of sampling, but any data scientist will tell you that there are much more nuanced and complex ways to sample data. You can read about more types of sampling herehere. With this foundation, you’ll be better able to understand how your data is analyzed. Be sure to check out our complete guide to distrubuted tracingcomplete guide to distrubuted tracing!

Interested in joining our team? See our open positions herehere.

Learn moreLearn more
March 17, 2020
4 min read
Distributed Tracing

Share this article

About the author

Katy Farmer

A modern guide to distributed tracing

Austin Parker | Dec 21, 2022

Austin Parker reviews developments, innovations, & updates in the world of distributed tracing

Learn moreLearn more

Distributed Tracing: Why It’s Needed and How It Evolved

Austin Parker | Oct 1, 2020

Distributed tracing is the “call stack” for a distributed system, a way to represent a single request as it flows from one computer to another.

Learn moreLearn more

How we built & scaled log search and pattern extraction

Karthik Kumar, Katia Bazzi | Jul 31, 2020

We recently added the ability to search and aggregate trace logs in Lightstep! This article will go through how we built & scaled log search and pattern extraction.

Learn moreLearn more
THE CLOUD-NATIVE RELIABILITY PLATFORM

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems