OpenTelemetry 101: What Is Tracing?
by Austin Parker
In the first OpenTelemetry 101 post, I introduced observability as the set of practices and tools that turn telemetry –– tracing, metrics, and logs –– into valuable insights. Today, I'll be taking a deeper look at the first of these, tracing.
Note: The information in this post is subject to change as the specification for OpenTelemetry continues to mature.
What is tracing?
Traditionally, tracing is a low-level practice used to profile and analyze application code by developers through a combination of specialized debugging tools (such as dtrace on Linux or ETW on Windows) and programming techniques. When we refer to tracing in OpenTelemetry, we’re generally referring to distributed tracing (or distributed request tracing), an application of these traditional tracing techniques to modern, microservice-based architectures.
Microservices and tracing
Microservices introduce significant challenges to tracing a request through an application, thanks to the distributed nature of microservices deployments. Consider a traditional monolithic application: with a code base centralized onto a single host, diagnosing a failure can be as simple as following a single stack trace. But, when an application consists of tens, hundreds, or thousands of services running across many hosts, it is no longer possible to rely on an individual trace. Instead, you need something that represents the entire request as it moves from service to service, component to component. Distributed tracing solves this problem, providing powerful capabilities such as anomaly detection, distributed profiling, workload modeling, and diagnosis of steady-state problems.
OpenTelemetry and tracing
Much of the terminology and mental models that we use to describe distributed tracing can trace their origin to systems such as Magpie, X-Trace, and Dapper. Dapper, in particular, has been highly influential to modern distributed tracing efforts, and many of the mental models and terminology that OpenTelemetry uses can trace their origin to that project. The goal of these distributed tracing efforts has been to profile requests as they move across service boundaries, generating high-quality data about those requests suitable for analysis.
The diagram in Figure 1 represents a sample trace. A trace is a collection of linked spans, which are named and timed operations that represent a unit of work in the request. A span that isn’t the child of any other span is the parent span, or root span, of the trace. The root span describes the end-to-end latency of the entire trace, with child spans representing sub-operations.
To put this in more concrete terms, consider the request flow of a system that you might encounter in the real world, a ridesharing app. When a user requests a ride, multiple actions begin to take place –– information is passed between services in order to authenticate and authorize the user, validate payment information, locate nearby drivers, and dispatch one of them to pick up the rider.
A simplified diagram of this system, and a trace of a request through it, appears in the following figure. As you can see, each operation generates a span to represent the work being done during its execution. These spans have implicit relationships (parent-child) both from the beginning of the entire request at the client, but also from individual services in the trace. Traces are composable in this way: a valid trace consists of valid sub-traces.
OpenTelemetry and spans
Each span in OpenTelemetry encapsulates several pieces of information, such as the name of the operation it represents, a start and end timestamp, events and attributes that occurred during the span, links to other spans, and the status of the operation. In Figure 1, the dashed lines connecting spans represent the context of the trace. The context (or trace context) contains several pieces of information that can be passed between functions inside a process or between processes over an RPC. In addition to the span context, identifiers that represent the parent trace and span, the context can contain other information about the process or request, like custom labels. As mentioned before, an important feature of spans is that they are able to encapsulate a host of information. Much of this information, such as the operation name and start/stop timestamps, is required — but some is optional. OpenTelemetry offers two data types, Attribute and Event, which are incredibly valuable as they help to contextualize what happens during the execution measured by a single span.
Attributes (known as tags in OpenTracing) are key-value pairs that can be freely added to a span to help in analysis of the trace data. You can think of Attributes as data that you would like to eventually aggregate or use to filter your trace data, such as a customer identifier, process hostname, or anything else that fits your tracing needs. Events (known as logs in OpenTracing) are time-stamped strings that can be attached to a span, with an optional set of Attributes that provide further description. OpenTelemetry additionally provides a set of semantic conventions of reserved attributes and events for operation or protocol specific information. Spans in OpenTelemetry are generated by the Tracer, an object that tracks the currently active span and allows you to create (or activate) new spans.
Tracer objects are configured with Propagator objects that support transferring the context across process boundaries. The exact mechanism for creating and registering a Tracer is dependent on your implementation and language, but you can generally expect there to be a global Tracer capable of providing a default tracer for your spans, and/or a Tracer provider capable of granting access to the tracer for your component. As spans are created and completed, the Tracer dispatches them to the OpenTelemetry SDK’s Exporter, which is responsible for sending your spans to a backend system for analysis.
- A span is the basic building block of a trace. A trace is a collection of linked spans
- Spans are objects that represent a unit of work, which is a named operation such as the execution of a microservice or a function call
- A parentless span is known as the root span or parent span of a trace
- Spans contain Attribute and Event objects, which describe and contextualize the work being done under a span
- A Tracer is used to create and manage spans inside a process, across process boundaries, and through propagators
In my next post in this series, I will discuss the OpenTelemetry metrics data source, and how it interacts with the traces. Stay tuned! If you're interested in learning more about OpenTelemetry now, you can read our OpenTelemetry 101: Technical Guide.