Lightstep from ServiceNow Logo

Products

Solutions

Documentation

Resources

Lightstep from ServiceNow Logo
< all blogs

How to Launch a Distributed Tracing MVP with Just 50 Lines of Code

There comes a time in every successful technology company’s life when there’s a realization that it’s not quite clear what’s happening in production, and that lack of clarity is impacting customers. That point may happen with a monolith, a distributed monolith, SOA, microservices, or, often, a mix of them all. Perhaps someone says “let’s do distributed tracing, it should solve all of our observability problems.” However when you look at the investment involved, you may end up thinking, “that’s hundreds of thousands of dollars in people’s time, not even counting the cost of the service, how can I possibly justify that?”

This post is for people facing this same question I faced three years ago.

Wearing the Customer’s Shoes

Working at Twilio on the Insight Engineering team, I had the opportunity to spend a few months looking at what it would take to “do distributed tracing.” Twilio had hundreds of services in several languages. There was significant “migration fatigue” after OS version, instance generation, and Classic to VPC moves. The appetite for another cross-team effort was low.

At first it seemed like an impossible problem: Distributed Tracing would require efforts across teams in different languages, different frameworks, all on different schedules. It was an impossible problem looking at it that way. But somehow I needed to find the MVP for getting started with distributed tracingdistributed tracing.

Twilio has a saying, “wear the customer’s shoes.” Reflecting on this, I decided that the best way to do that was to start instrumenting as close to the customer as possible, at the API edge service. By starting there I would see what each customer was experiencing for the entire time our platform was handling the request for each endpoint and method. I could tag each trace after authentication so that we could see a particular customer’s experience. Even better, when we decided to instrument further into the services that handled any given request, we’d always have that “customer’s shoes” context to start with.

Getting to a Root Cause

In the spirit of minimum viable product, I put together a PR of less than 50 lines. For every request received, it would create a span that represented the amount of time it took us to respond to that request. It used a standard prefix indicating it was a public API request, the standard reference name for the API resource, and tagged the method and customer. I also wrapped every request where the API service was a client to other services tagged with the downstream service and method. After some experiments, including some Chaos testing in staging, I was cleared to deploy a canary to production.

Though we had metrics and simple histograms before, what we could see with this view — especially over time — was a game changer. The canary happened to be deployed during a performance issue with a downstream service. I was able to bring the cause of the issue to both the API and service teams quickly, and they were able to rollback within minutes. With this demonstration of the capability of tracing, there was suddenly interest in removing the API team from the critical path for identifying the root cause for performance regressions or outages.

A Playbook for Launching Your Distributed Tracing MVP

The 50-line PoC turned into a purposeful refactor of the request handling and client code to provide a simple single point of integration for tracing. Overall, the resulting changes were less than 200 lines of code and a bit more than one week of one engineer on the API team — substantially less than 20 or so person years of time it had originally appeared to be.

If you’re wondering how to get started with tracing, consider using this pattern as a playbook:

  1. Identify a part of your service that’s as close to your customer as you can get.

  2. Look for patterns in how that service receives requests that enable you to instrument once or at most a handful of times.

  3. Find trends in how that service makes requests to services, SaaS, and databases.

  4. Follow production deployment steps (staging, canaries, or whatever other risk management strategies your company uses) and start getting real data.

  5. Compare trace data with other metrics and understand the potential cause of differences.

  6. Observe the visibility of failure, either “naturally” or induced by Chaos testing.

  7. After you’ve found a key use case, continue to make measured investments driven by observed value.

While this approach is helpful, at some point you will face the challenge of perspective. If you only have the edge’s client perspective and the perspective of server differs, you’ll need to figure out how the truth lies between them.

In future posts, I’ll cover mobile- and browser-based perspectives, integrating a service mesh into your tracing, and methods for adding internal services using frameworks or middleware.

If you have any questions about getting started with a Distributed Tracing MVP, you can reach me on Twitter @1mentat@1mentat.

Interested in joining our team? See our open positions herehere.

March 28, 2019
5 min read
Distributed Tracing

Share this article

About the author

James Burns
Distributed Tracing

A modern guide to distributed tracing

Austin Parker | Dec 21, 2022

Austin Parker reviews developments, innovations, & updates in the world of distributed tracing

Learn moreLearn more
Distributed Tracing

Distributed Tracing: Why It’s Needed and How It Evolved

Austin Parker | Oct 1, 2020

Distributed tracing is the “call stack” for a distributed system, a way to represent a single request as it flows from one computer to another.

Learn moreLearn more
Distributed Tracing

How we built & scaled log search and pattern extraction

Karthik Kumar, Katia Bazzi | Jul 31, 2020

We recently added the ability to search and aggregate trace logs in Lightstep! This article will go through how we built & scaled log search and pattern extraction.

Learn moreLearn more
THE CLOUD-NATIVE RELIABILITY PLATFORM

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems