Lightstep from ServiceNow Logo

Products

Solutions

Documentation

Resources

Lightstep from ServiceNow Logo
< all blogs

Best Practices for Instrumenting Applications with OpenTracing

Distributed tracingDistributed tracing is a powerful tool for monitoring and managing the performance of today's complex distributed systems. At Lightstep, we've analyzed more than 10 billion traces and helped leading enterprises all over the world, including GitHub, Lyft, Twilio, Segment, Skyscanner, and many others, successfully instrument their applications and leverage the power of distributed tracing. With this experience, we know where things can go wrong and the best practices that teams can adopt to ensure success.

OpenTracing: The basics

When we talk about instrumentation, we're mainly focused on the OpenTracing APIOpenTracing API. It was created by industry experts to solve a growing and recognized problem – how to gain visibility into distributed systems. The OpenTracing API is a layer that sits between the origin of the data that we want to gain visibility into (such as the application logic, microservices frameworks, and RPC libraries), and it feeds it to the Lightstep tracer system.

OpenTracing Architecture

Getting started

To begin instrumentation, we recommend identifying the relevant frameworks. Modern distributed systems consist of large, shared libraries. Often, the OpenTracing community has already provided some helper plugins that can add tracing and instrumentation to these libraries. Anything that can't be covered with these libraries can be instrumented directly with the OpenTracing API, so there are no gaps in tracing and instrumentation. Finally, we provide the Lightstep tracer library in order to send the data to our SaaS backend and present it for analysis.

Lightstep - Tracing Code and Instrumentation

Instantiating the Lightstep tracers is a very easy process that requires minimal effort. The OpenTracing community manages the opentracing-contrib project, which adds OpenTracing support to popular libraries and frameworks and publishes them in a central repository for library developers to use. Leveraging these plugins expands OpenTracing coverage within an application and can help reduce the required explicit code instrumentation directly.

What to trace

Identify a discrete high-value transaction

So, how do we decide what to start tracing? Modern distributed systems have many moving parts, and it's often difficult to determine what's a transaction or what's a request. One option is to identify something with discrete and specific timestamp bookends. These can be operations like adding a product to a cart or booking an appointment which have a beginning and an end. They are prime targets for instrumentation because we can track the latency and see where the time is actually being spent.

Identify the points of ingress and egress

Identify the points of ingress and egress, and instrument breadth first, not depth first. Rather than trying to gain deep visibility into a specific request or operation, first we want to get just the entry and exit points in order to get timestamp bookends, so we have an approximation of overall latency. Once we've fed this data into the Lightstep tracer system for analysis, we have our first end-to-end trace reported, and then we can add details such as tags, logs, and finer granularity like instrumenting interfunctions.

How many spans do we need?

There are common questions about instrumentation such as how many spans do we need and how granular do we recommend going. A general rule of thumb for beginning instrumentation is to do fewer, larger spans – about one to three per library or component. Here's a practical, real-world example. Imagine this is our request-response lifecycle: leaving the house, getting on a bicycle, going to the store, buying bread, and returning home. We're interested in how much time was spent on the bicycle, buying bread, or on a particular block. We're not interested in granular operations such as how much time a single pedal turn took because it would create a million spans that are extraneous. Software systems can be similar to this example, and that's why we recommend beginning with fewer, larger spans, gaining approximations for latency, and then doing deep dives as a follow-up to gain greater visibility into targeted operations.

Lightstep Tracing Spans Example

What about tags?

The OpenTracing project defines standard tags, such as error to indicate if the operation has failed and component to identify the software package. There are plenty of others, such as HTTP status code or peer hostname, and they're available on the OpenTracing website and GitHub repositoryGitHub repository. However, tags are arbitrary and user-defined key:value pairs, so it's important to standardize them when multiple teams are going to share information in order to avoid discrepancies.

Best practices

Centralized resources and documentation are critical. Everyone should go to the same place, start the same way, and know where to go to get additional information. This ensures uniformity as well as common knowledge, which can really help propagate and evangelize tracing within the organization.

Shared frameworks or helper libraries are another great place to start adding instrumentation. Many teams and services can use them, and it's a way to get broad coverage with relatively low effort.

Standardized tag and naming conventions as well as adding logs to spans for more robust context are ways to get fine, granular, timestamped information without creating too many spansspans.

And finally, incorporate tracing into the service-provisioning process to ensure that tracing continues to expand in the future.

Pitfalls to avoid

The biggest risk in the process of getting started with distributed tracing is only doing partial instrumentation. Often, someone becomes interested in tracing, acts as a champion by convincing groups and teams to use it, but not everyone does. That means there's incomplete data, so it's hard to show value, the momentum drops off, and the tracing effort ends. This also ties into under-resourcing the project. It needs to be a real initiative within the organization. Otherwise, it's a lot of work but no value. It's also important that we trace the complete request-response lifecycle rather than focus on specific parts. That will provide the full visibility that we want.

Distributed tracing is an extremely powerful solution for engineering teams whether they are in firefighting mode and trying to find the root cause of a performance issueperformance issue or trying to improve overall performance. Instrumentation is vital to get started with distributed tracing successfully. Contact usContact us and let our team of experts help you.

Interested in joining our team? See our open positions herehere.

January 30, 2019
6 min read
OpenTracing

Share this article

About the author

Alex Masluk

Tracing in Angular: How to get started

Forrest Knight | Sep 30, 2019

Angular is a very opinionated front end framework that utilizes Typescript. In this article, we add traces in two approaches. The first being more of a drop in, the second a more directed and opinionated approach.

Learn moreLearn more

OpenTracing vs OpenTelemetry: Building a Bridge with OpenTelemetry

Carlos Alberto Cortez | Jun 26, 2019

One of the main priorities for the merger is straightforward backwards compatibility with both OpenTracing and OpenCensus, via software bridges. In this post, we’ll discuss the OpenTracing to OpenTelemetry bridge.

Learn moreLearn more

Introducing Java SpecialAgent: Start Tracing without Writing Any Code

Ted Young | May 14, 2019

We’re excited to announce Java SpecialAgent! It’s an extensible OSS Java agent for OpenTracing that enables end-to-end tracing without having to write any code.

Learn moreLearn more
THE CLOUD-NATIVE RELIABILITY PLATFORM

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems