A Brief History of “The Span”: Hard to Love, Hard to Kill
by Ben Sigelman
To the best of my knowledge, I’m the one to blame for the term “span,” at least as it relates to distributed tracing. I remember when and where the term itself was born – it was Christmas break in 2005, and I was home for the holidays. Everyone else had gone to bed, and I was up late and enjoying the solitude (as well as the week-long yuletide respite from Google’s usual deluge of email). In particular, I was trying to generalize a promising prototype project called “Dapper.” I created a new file called “tracer.h” and started sketching out an interface for something that would go beyond hacky RPC-only tracing, which is all that Dapper did at the time. I had to come up with a name for a new C++ class I was creating, and – after some deliberation – went with span.
Before we proceed, we should define our terms: what is a span in the context of distributed tracing? In Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, published several years later, we defined the term as follows:
There’s the old joke: “There are only two hard things in computer science: naming, cache invalidation, and off-by-one errors.” It’s all true, especially the bit about naming. Conceptually, there are many different ways to approach the naming possibilities for this “span” concept:
- Within the code itself, the API feels like a timer
- When one considers the trace as a directed graph, the data structure seems like a node or vertex
- In the context of structured, multi-process logging (side note: at the end of the day, that’s what distributed tracing is), one might think of a span as two events
- Given a simple timing diagram (like the ones pulled from the Dapper paper above), it’s tempting to call the concept a duration or window
This is all to say that “span” was initially a longshot contender; but then the rest were all eliminated, mostly due to being overloaded. As a term, “span” was unsullied by previous associations: nobody had any idea about what a span was (and wasn’t). That gave it a big leg up over timers, nodes, durations, and the other alternatives.
But the really interesting thing about spans is not how they got their name: it’s why this concept has been so enduring.
Every few months, someone approaches the OpenTelemetry project (or its predecessor, OpenTracing) and explains that spans are really just pairs of structured events; and, further, that it would be better to represent them that way for the sake of generality. In many respects this is totally sensible, and I’ve made the same argument myself from time to time. There is even (high-quality!) prior art for this approach in the academic tracing world – most notably X-Trace and the more recent Pivot Tracing work (also, Tracelytics, the O.G. commercial tracing company, built its product and backend around the span-less X-Trace model).
To understand why spans have survived despite being less general, less familiar, and arguably harder to instrument than these alternatives, we have to think about the scenarios where distributed tracing comes up in the first place: namely, for improving the performance of latency-sensitive, transactional services. There are really three factors at play:
- Service performance is almost entirely dependent on the service’s dependencies
- Those dependencies are typically accessed via some variety of nested RPCs (“Remote Procedure Calls”)
- The telemetry data from these multi-service traces is so large that it is infeasible to assemble 100% of the traces while remaining ROI-positive from an observability standpoint
So what does this have to do with spans? Well, the third factor tells us that we can’t assemble all of the traces all of the time – so how do we decide which ones are worth our finite resources? We need to know whether service performance was a problem, and we need to know it prior to trace assembly. And that is why spans are so convenient. In addition to being a single, atomic data structure, a span always has a duration; and that duration means we have the performance data we need without looking elsewhere. This ubiquitous, accessible duration data also makes it easy to build things like Google’s recently-open-sourced z-pages and awesome hacks other features.
This is all a long way of saying that spans are the crucial performance optimization that enables intelligently-biased, low-cost trace assembly. Without guaranteed access to a duration, we are reduced to completely random sampling – and modern observability demands better than that!
Spans are suboptimal for non-nested computations. For instance, tracing a transaction that wends its way into and out of Kafka queues can be done with spans, but there are often large blocks of time that are unaccounted for, unfortunately including the period when the transaction was waiting in a queue. Since span-oriented tracing systems assume that all latency-sensitive operations are wrapped in spans, it’s tricky to focus an analysis on these streaming workloads, where the latency-sensitive sections appear as whitespace in the traces.
There are also situations in which unrelated transactions are batched, so once again the idea of transactional nesting is broken. For instance, a distributed database (like Cassandra) might use log-structured writes to improve write throughput and lower write latency, then occasionally issue some sort of compaction to rebuild a read-structured data structure, to allow for faster initialization and data resharding. The compaction thus batches the cost of indexing potentially thousands of previous transactions – are those transactions parents of the compaction span? Or just “related work”? Depending on your point of view, you can make either argument; but in both cases, spans are an uncomfortable fit.
Distributed tracing must form the backbone of any modern observability strategy, and, gradually, our industry is coming to terms with that fact. And eventually, once our software correctly propagates trace context from mobile down through managed infrastructure, we may be able to migrate to a more complex and flexible model that still allows for a production-grade tracing system’s sampling and assembly optimizations.
But for the time being, we need spans: straightforward, “good-enough” telemetry with a built-in ranking function.