There comes a time in every successful technology company’s life when there’s a realization that it’s not quite clear what’s happening in production, and that lack of clarity is impacting customers. That point may happen with a monolith, a distributed monolith, SOA, microservices, or, often, a mix of them all. Perhaps someone says “let’s do distributed tracing, it should solve all of our observability problems.” However when you look at the investment involved, you may end up thinking, “that’s hundreds of thousands of dollars in people’s time, not even counting the cost of the service, how can I possibly justify that?”
This post is for people facing this same question I faced three years ago.
Wearing the Customer’s Shoes
Working at Twilio on the Insight Engineering team, I had the opportunity to spend a few months looking at what it would take to “do distributed tracing.” Twilio had hundreds of services in several languages. There was significant “migration fatigue” after OS version, instance generation, and Classic to VPC moves. The appetite for another cross-team effort was low.
At first it seemed like an impossible problem: Distributed Tracing would require efforts across teams in different languages, different frameworks, all on different schedules. It was an impossible problem looking at it that way. But somehow I needed to find the MVP for getting started with distributed tracing.
Twilio has a saying, “wear the customer’s shoes.” Reflecting on this, I decided that the best way to do that was to start instrumenting as close to the customer as possible, at the API edge service. By starting there I would see what each customer was experiencing for the entire time our platform was handling the request for each endpoint and method. I could tag each trace after authentication so that we could see a particular customer’s experience. Even better, when we decided to instrument further into the services that handled any given request, we’d always have that “customer’s shoes” context to start with.
Getting to a Root Cause
In the spirit of minimum viable product, I put together a PR of less than 50 lines. For every request received, it would create a span that represented the amount of time it took us to respond to that request. It used a standard prefix indicating it was a public API request, the standard reference name for the API resource, and tagged the method and customer. I also wrapped every request where the API service was a client to other services tagged with the downstream service and method. After some experiments, including some Chaos testing in staging, I was cleared to deploy a canary to production.
Though we had metrics and simple histograms before, what we could see with this view — especially over time — was a game changer. The canary happened to be deployed during a performance issue with a downstream service. I was able to bring the cause of the issue to both the API and service teams quickly, and they were able to rollback within minutes. With this demonstration of the capability of tracing, there was suddenly interest in removing the API team from the critical path for identifying the root cause for performance regressions or outages.
A Playbook for Launching Your Distributed Tracing MVP
The 50-line PoC turned into a purposeful refactor of the request handling and client code to provide a simple single point of integration for tracing. Overall, the resulting changes were less than 200 lines of code and a bit more than one week of one engineer on the API team — substantially less than 20 or so person years of time it had originally appeared to be.
If you’re wondering how to get started with tracing, consider using this pattern as a playbook:
- Identify a part of your service that’s as close to your customer as you can get.
- Look for patterns in how that service receives requests that enable you to instrument once or at most a handful of times.
- Find trends in how that service makes requests to services, SaaS, and databases.
- Follow production deployment steps (staging, canaries, or whatever other risk management strategies your company uses) and start getting real data.
- Compare trace data with other metrics and understand the potential cause of differences.
- Observe the visibility of failure, either “naturally” or induced by Chaos testing.
- After you’ve found a key use case, continue to make measured investments driven by observed value.
While this approach is helpful, at some point you will face the challenge of perspective. If you only have the edge’s client perspective and the perspective of server differs, you’ll need to figure out how the truth lies between them.
In future posts, I’ll cover mobile- and browser-based perspectives, integrating a service mesh into your tracing, and methods for adding internal services using frameworks or middleware.
If you have any questions about getting started with a Distributed Tracing MVP, you can reach me on Twitter @1mentat.