Distributed tracing is a powerful tool for monitoring and managing the performance of today’s complex distributed systems. At LightStep, we’ve analyzed more than 10 billion traces and helped leading enterprises all over the world, including GitHub, Lyft, Twilio, Segment, Skyscanner, and many others, successfully instrument their applications and leverage the power of distributed tracing. With this experience, we know where things can go wrong and the best practices that teams can adopt to ensure success.
When we talk about instrumentation, we’re mainly focused on the OpenTracing API. It was created by industry experts to solve a growing and recognized problem – how to gain visibility into distributed systems. The OpenTracing API is a layer that sits between the origin of the data that we want to gain visibility into (such as the application logic, microservices frameworks, and RPC libraries), and it feeds it to the LightStep tracer system.
To begin instrumentation, we recommend identifying the relevant frameworks. Modern distributed systems consist of large, shared libraries. Often, the OpenTracing community has already provided some helper plugins that can add tracing and instrumentation to these libraries. Anything that can’t be covered with these libraries can be instrumented directly with the OpenTracing API, so there are no gaps in tracing and instrumentation. Finally, we provide the LightStep tracer library in order to send the data to our SaaS backend and present it for analysis.
Instantiating the LightStep tracers is a very easy process that requires minimal effort. The OpenTracing community manages the opentracing-contrib project, which adds OpenTracing support to popular libraries and frameworks and publishes them in a central repository for library developers to use. Leveraging these plugins expands OpenTracing coverage within an application and can help reduce the required explicit code instrumentation directly.
What to trace
Identify a discrete high-value transaction
So, how do we decide what to start tracing? Modern distributed systems have many moving parts, and it’s often difficult to determine what’s a transaction or what’s a request. One option is to identify something with discrete and specific timestamp bookends. These can be operations like adding a product to a cart or booking an appointment which have a beginning and an end. They are prime targets for instrumentation because we can track the latency and see where the time is actually being spent.
Identify the points of ingress and egress
Identify the points of ingress and egress, and instrument breadth first, not depth first. Rather than trying to gain deep visibility into a specific request or operation, first we want to get just the entry and exit points in order to get timestamp bookends, so we have an approximation of overall latency. Once we’ve fed this data into the LightStep tracer system for analysis, we have our first end-to-end trace reported, and then we can add details such as tags, logs, and finer granularity like instrumenting interfunctions.
How many spans do we need?
There are common questions about instrumentation such as how many spans do we need and how granular do we recommend going. A general rule of thumb for beginning instrumentation is to do fewer, larger spans – about one to three per library or component. Here’s a practical, real-world example. Imagine this is our request-response lifecycle: leaving the house, getting on a bicycle, going to the store, buying bread, and returning home. We’re interested in how much time was spent on the bicycle, buying bread, or on a particular block. We’re not interested in granular operations such as how much time a single pedal turn took because it would create a million spans that are extraneous. Software systems can be similar to this example, and that’s why we recommend beginning with fewer, larger spans, gaining approximations for latency, and then doing deep dives as a follow-up to gain greater visibility into targeted operations.
What about tags?
The OpenTracing project defines standard tags, such as
error to indicate if the operation has failed and
component to identify the software package. There are plenty of others, such as
HTTP status code or
peer hostname, and they’re available on the OpenTracing website and GitHub repository. However, tags are arbitrary and user-defined key:value pairs, so it’s important to standardize them when multiple teams are going to share information in order to avoid discrepancies.
Centralized resources and documentation are critical. Everyone should go to the same place, start the same way, and know where to go to get additional information. This ensures uniformity as well as common knowledge, which can really help propagate and evangelize tracing within the organization.
Shared frameworks or helper libraries are another great place to start adding instrumentation. Many teams and services can use them, and it’s a way to get broad coverage with relatively low effort.
Standardized tag and naming conventions as well as adding logs to spans for more robust context are ways to get fine, granular, timestamped information without creating too many spans.
And finally, incorporate tracing into the service-provisioning process to ensure that tracing continues to expand in the future.
Pitfalls to avoid
The biggest risk in the process of getting started with distributed tracing is only doing partial instrumentation. Often, someone becomes interested in tracing, acts as a champion by convincing groups and teams to use it, but not everyone does. That means there’s incomplete data, so it’s hard to show value, the momentum drops off, and the tracing effort ends. This also ties into under-resourcing the project. It needs to be a real initiative within the organization. Otherwise, it’s a lot of work but no value. It’s also important that we trace the complete request-response lifecycle rather than focus on specific parts. That will provide the full visibility that we want.
Distributed tracing is an extremely powerful solution for engineering teams whether they are in firefighting mode and trying to find the root cause of a performance issue or trying to improve overall performance. Instrumentation is vital to get started with distributed tracing successfully. Contact us and let our team of experts help you.