Dapper, Google’s distributed tracing solution
What we can learn from an early implementation of distributed tracing
Table of Contents
In their seminal white paper, the authors of “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure” (including Lightstep co-founder and CEO, Ben Sigelman) report on Google’s internal distributed tracing solution. The paper describes how they built and deployed Dapper and how it was used by engineers across the organization.
Google is a unique organization in many ways, so when looking at technology choices made at Google, we must be careful not to create a cargo cult. However, this paper was very influential on subsequent open source projects, including Zipkin and Jaeger and on the terminology of tracing (including concepts like spans). Understanding Dapper can shed a lot of light on the approaches these projects have taken.
In addition, the paper also dispels a couple of common misconceptions about tracing:
- Tracing doesn’t need to be high-overhead – and it can be “always-on.” The real costs of tracing are in the infrastructure to transmit and store all of the raw data.
- Tracing is not just about looking at individual traces – some of the most powerful use cases at Google involved analyzing thousands or millions of traces. Dapper’s API plays a key role in enabling teams to use raw trace data.
- Tracing is not just about finding factors contributing to production incidents – Dapper has many use cases including long-tail performance optimization, discovering service dependencies, and as part of testing and continuous integration.
You can download the original paper to learn more.