You’ve probably heard of Distributed Tracing -- it’s a form of application telemetry, sometimes referred to as a “pillar of observability” alongside metrics, and logs. You might be familiar with tools like Jaeger and Zipkin, which are open source trace storage and analysis tools -- they help you visualize and search for distributed traces. Perhaps you know of OpenTelemetryOpenTelemetry, an open source standard on instrumenting software to create distributed trace data?
Regardless, you probably have a lot of questions about distributed tracing. I certainly did, which is why I wrote a book about it! What’s remarkable is that there’s been a lot of new developments, innovations, and updates in the world of distributed tracing over the past two years -- enough to warrant a quick review of what constitutes modern distributed tracing.
Tracing In A Nutshell
Let’s start with a quick review. What is a trace, anyway? Fundamentally, a trace is a record of some request, or transaction, in a distributed system. Traces are made up of spans, which are records of discrete units of work. A single span will usually map to a single microservice (or equivalent functional unit in your architecture) and should contain useful metadata about a request. Traces are linked by way of a context, a shared identifier that is unique to each request. Each span in a trace shares the same trace context.
To illustrate this, think about a basic 3-tier web application. You’ve got some client running in the browser, a server that exposes an API, and a database or other store to record and persist application data. A trace would give you detailed performance information about each component on a per-request basis. Each span would record interesting metadata about the request -- such as the user agent of the client browser, the particular hostname and region of the server, and the query being sent to the database. In addition, each span would record the amount of time spent to process and handle its work, any logging data that you might wish to emit, and more.
Now, we can use these traces in many ways. Often, developers and SREs will find them valuable to simply understand application health on a per-request basis. A single trace can be useful in order to inspect a single transaction, after all. This is only the tip of the iceberg, however -- tools such as Lightstep Observability allow you to aggregate many traces together and analyze them based on their metadata. This gives you the ability to ask interesting questions about your distributed system -- for example, to compare average or p95 latency across different cloud regions, or to evaluate the error rates between different versions of a service or dependency. These are questions that are challenging to ask with traditional metrics or logs alone -- in the case of metrics, you often lack the required dimensions in order to perform these complex queries, and in the case of logs, you may be required to perform costly (and slow) searches across un-indexed data.
What’s New With Tracing?
Over the past couple of years, distributed tracing has gone from a ‘nice to have’ to a ‘need to have’ for many operators. This shift has been led by three important advances in the world of tracing: OpenTelemetry, eBPF, and Continuous Profiling. Let’s talk about these three trends.
OpenTelemetry has introduced many powerful, and crucial, features to the distributed tracing story. Foremost, it provides a standard data format, API, SDK, and toolset to developers, SREs, and maintainers who wish to add or integrate distributed tracing into their applications, services, frameworks, and systems. The value of OpenTelemetry is threefold:
It provides a broadly supported and standard API and data model for creating and serializing telemetry data.
It’s a lingua franca -- you can write your instrumentation code once, but use the data with dozens of open source or commercial analysis tools.
It defines a set of semantic conventions and standards for telemetry metadata. Your telemetry will be consistent across cloud providers, application runtimes, languages, and more.
Finally, OpenTelemetry is extensible. You’re not limited to writing your own instrumentation from scratch; you can use prebuilt agents and integrations to quickly get started. In addition, this extensibility makes it ideal for integration into frameworks and libraries. We’re starting to see more native integration of OpenTelemetry into popular frameworks like SpringSpring and .NET Core.NET Core.
The second big change in tracing has been the rise of eBPFeBPF. While a complete discussion of eBPF is out of scope for this post, the quick explanation is that it allows for traces to be generated at the kernel level rather than in application code. This means that we can trace our distributed system without having to create any instrumentation in our application at all– as long as we’re comfortable with the tradeoff of losing our ability to understand what’s happening at the code level. The real power of eBPF isn’t in its ability to replace application tracing, however, but to extend it. It provides more granular information about things that are difficult to quantify with white box tracing only such as DNS lookups, connection overhead, etc.
Courtesy of eBPF Foundation.
A common critique of tracing is that while it provides a great “bird’s eye view” of application performance, it lacks the kind of granular details needed to positively identify problems to the line of code level. Continuous profiling seeks to alleviate this by silently, and continuously, profiling an application in production in order to create the kind of per-function data that developers need in order to optimize their application. Some profiling tools even integrate with OpenTelemetry out of the box. This allows profiles to be associated with distributed traces for a “best of both worlds” scenario; you can not only understand request health from the perspective of an end-user, but also drill down to a specific line of code in that request to understand why performance is what it is.
Getting Started with Tracing
There’s never been a better time to start using distributed tracing in your system. First, though, you need to ask yourself a few questions in order to orient yourself towards the best place to start.
Do you have an existing, microservice-based application or not?
How is your application deployed and run
Can you modify your system’s source code?
If you’re starting from nothing and just want an idea of how this all works in practice, I recommend the OpenTelemetry DemoOpenTelemetry Demo. This is a sample application, created and maintained by the OpenTelemetry project, which shows how to use OpenTelemetry in over eleven languages. It’s the perfect place to get started to help you understand how all the pieces fit together.
If you’ve got an existing application deployed to Kubernetes, then using the OpenTelemetry Operator for KuberntetesOpenTelemetry Operator for Kuberntetes allows you to deploy and configure not only instrumentation for your application, but also the required infrastructure needed to collect, process, and transmit the telemetry to a destination for analysis. This is also ideal if you can’t modify the source of your application, as it can even inject automatic instrumentation agents for you.
The automatic instrumentation in OpenTelemetry differs in exact details per language, but the goal is broadly similar -- to provide traces for critical libraries and functions with no code changes. These agents can be integrated into your existing software as sidecar processes or run alongside them on a virtual machine or bare metal, and are configured through environment variables.
While exploring a demo or automatically instrumenting existing applications via agents is a good way to get started, you’ll come to the point where you need more than they offer in order to unlock deeper insights into your system. Automatic instrumentation can tell you a lot, but it doesn’t know much about the business logic of your services, nor the metadata that’s useful to you in order to create interesting aggregations. For example, while you could use automatic instrumentation in order to discover that requests in a particular availability zone are performing worse than others, you would need custom instrumentation to let you know which specific customers were impacted by it.
The utility of tracing can’t be denied in terms of understanding transaction performance at a per-request level. There are critiques that have been leveled against it, though, especially around the volume of additional data that tracing can generate. While this additional data usually pays for itself in your ability to understand system health and performance, there are a lot of details to consider.
What population of traces do I need to keep? For instance, should I only keep traces where an error occurred or a representative sample of all traces?
How long should I keep full traces versus aggregate statistical data about those traces?
Who’s responsible for maintaining my trace data, both at an instrumentation and at an analytical level?
When should I use traces to investigate performance vs. metrics or logs?
As you can see, there’s no “one size fits all” answer here -- you need to consider your organization, your application architecture, your team structure, and so forth.
What is crucial, however, is that we don’t think of traces as the be-all-and-end-all of application telemetry. They’re extremely useful for what they’re useful for: understanding application transactions in a distributed system. A requirement to get the most value out of them, then, is to unify these traces along with other telemetry signals like metrics and logs, in order to get the most value out of all our telemetry data.
Signal unification -- or, ‘unified observability’ -- requires us to consider our application telemetry not just as separate pillars of disconnected data, but as multiple streams of interrelated and interconnected data. This unification provides many benefits:
By unifying metrics and traces, we can use metrics to build histograms of traffic through our API endpoints and use traces to ensure we have representative samples of those requests by status code, route, and latency.
By unifying logging and traces, we can discover relationships between our resources (such as database servers, Kubernetes clusters, virtual machines, etc.) and the transactions that run on them.
Bringing all of our signals together allows us to make better decisions about what data to keep and what data to discard from a long-term storage perspective, reducing our observability costs overall.
Distributed tracing has grown from being perceived as a ‘nice-to-have’ niche into a powerful and effective primary observability signal for understanding and profiling distributed system performance. OpenTelemetry, eBPF, and continuous profiling continue to make strides in making tracing more powerful, more accessible, and more useful to developer workflows.
While it isn’t a panacea, distributed tracing is a powerful tool. It allows you to understand the health of your entire system, to see the relationship between services and dependencies, and to quickly pinpoint changes. This allows you to proactively discover contributing factors to an incident more quickly, and accurately - before they impact customers. It also helps reduce your monitoring costs, so your bill doesn’t outpace revenue growth. And finally, it helps you deploy more quickly, with more confidence.
See how Lightstep can help you improve uptime and reliability and make your tech estate more resilient. Schedule a demoSchedule a demo with our team.
December 21, 2022
10 min read
About the author
Austin ParkerRead moreRead more