Istio Distributed Tracing: How to Get Started with LightStep and Kubernetes

LightStep Tracing is an easy way to start using distributed tracing without deploying your own distributed tracing system. Istio is a “batteries included” set of best practices for deploying and managing containerized software. Istio proxy provides an automatic service mesh, based on Envoy, so that you can understand and control how different services communicate with each other. Envoy and therefore Istio support distributed tracing out of the box.

For this walkthrough, we’ll be using Google Kubernetes Engine (GKE) for our Kubernetes cluster. We’ll assume you have a working gcloud CLI installation that includes kubectl as well (gcloud components install kubectl).

Creating a Cluster

Step 1: Create the cluster

export CLUSTER_NAME=lst-walkthrough # name of the GKE cluster
export PROJECT_NAME= # name of the project to create cluster in
export ZONE=us-central1-a # zone to create cluster in

gcloud container clusters create $CLUSTER_NAME \
--cluster-version latest \
--machine-type n1-standard-2 \ # istio-telemetry failed to schedule with n1-standard-1
--num-nodes 3 \
--preemptible \ # to reduce costs for proof of concept
--zone $ZONE \
--project $PROJECT_NAME

Step 2: Store credentials locally for kubectl

gcloud container clusters get-credentials $CLUSTER_NAME --zone $ZONE --project $PROJECT_NAME

Step 3: Grant our user cluster-admin role

kubectl create clusterrolebinding cluster-admin-binding --clusterrole=cluster-admin --user=$(gcloud config get-value core/account)

Step 4: Create the Istio system namespace for the components

kubectl create namespace istio-system

Step 5: Download and unpack an Istio distribution

export ISTIO_VERSION=1.1.5
wget https://github.com/istio/istio/releases/download/$ISTIO_VERSION/istio-$ISTIO_VERSION-osx.tar.gz
tar xcvf istio-$ISTIO_VERSION-osx.tar.gz
cd istio-$ISTIO_VERSION

This is for OS X, if you’re using a different OS, refer to the Istio directions.

Step 6: Initialize Istio system certificates and Custom Resource Definitions (CRDs)

helm template install/kubernetes/helm/istio-init --name istio-init --namespace istio-system | kubectl apply -f -

Step 7: Check for completion of the creation of the CRDs

kubectl get crds | grep 'istio.io\|certmanager.k8s.io' | wc -l

In version 1.1.5, when complete this should print “53”

Step 8: Sign up for LightStep Tracing

  1. Navigate to https://go.lightstep.com/tracing.html
  2. Fill out the form
  3. Click on the link in the email to set up your account
  4. Go to Settings and copy the token

Step 9: Set up Istio with Helm

export ACCESS_TOKEN=""
helm template --set pilot.traceSampling=100 --set global.proxy.tracer="lightstep" --set global.tracer.lightstep.address="ingest.lightstep.com:443" --set global.tracer.lightstep.accessToken=$ACCESS_TOKEN --set global.tracer.lightstep.secure=true --set global.tracer.lightstep.cacertPath="/etc/lightstep/cacert.pem" install/kubernetes/helm/istio --name istio --namespace istio-system > $HOME/istio.yaml
kubectl apply -f ~/istio.yaml

Step 10: Create a cacert.pem file with the Let’s Encrypt Root CA

curl https://letsencrypt.org/certs/trustid-x3-root.pem.txt -o cacert.pem

The current version of the Istio LightStep integration requires a custom CA cert bundle to be specified. The public LightStep collectors at ingest.lightstep.com are from Let’s Encrypt.

Step 11: Add the file as a secret for the LightStep integration to use

kubectl create secret generic lightstep.cacert --from-file=cacert.pem

Step 12: Label your default namespace so that Istio will inject the Istio Proxy sidecar automatically

kubectl label namespace default istio-injection=enabled

Step 13: Wait for all pods to show as running (this can take a few minutes)

kubectl get pods --namespace istio-system

Step 14: Create the example BookInfo app and gateway:

kubectl apply -f samples/bookinfo/platform/kube/bookinfo.yaml
kubectl apply -f samples/bookinfo/networking/bookinfo-gateway.yaml

Step 15: Capture the information necessary to access the BookInfo app locally and use open to load it in your web browser

export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')
export SECURE_INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="https")].port}')

Step 16: Open BookInfo in your browser

open http://$INGRESS_HOST:$INGRESS_PORT/productpage

Viewing Distributed Traces

You can now go to https://app.lightstep.com/ and see traces for the example application.

Istio Distributed Tracing How To Get Started With LightStep And Kubernetes 1

Yay! A distributed trace! You can now see end-to-end transactions in your system. But, you may ask, how did it actually happen? It’s a common misunderstanding that when tracing with your service mesh, there aren’t any code changes. In fact, it’s necessary for services to pass through distributed tracing headers even if they are not participating in the trace. Let’s walk through those headers for Istio and where that’s implemented in the BookInfo sample application.

The Istio tracing documentation lists the following headers required to be forwarded:

  • x-request-id
  • x-b3-traceid
  • x-b3-spanId
  • x-b3-parentspanid
  • x-b3-sampled
  • x-b3-flags
  • b3

If using LightStep also:

  • x-ot-span-context

However, looking at the productpage service source, the only non-b3 header being forwarded is x-request-id. So, how is this working? Istio Proxy’s LightStep integration supports a special tag called guid. Since the x-request-id header is being forwarded and that header is being tagged as a guid on all the Istio Proxy spans, LightStep is able to infer the ordering and parentage of the spans with only that information. A change to the productpage source to properly forward the header needs to be made. For the other applications here are the places where the headers are captured and forwarded:

Details (Ruby) Captured Forwarded
Reviews (Java) Captured Forwarded

As you can see with the above list, there may be many headers to forward if you want to support Zipkin/Jaeger B3 headers, OpenTracing headers, and Istio Proxy (Envoy) headers. With the standardization of tracing headers with W3C Trace Context and OpenTelemetry this should be much simpler in the future.

Google’s June 2nd Outage: Their Status Page != Reality

Previously we’ve written about having hard conversations with cloud providers. On Sunday June 2nd, Google Cloud Platform had an extended networking-based outage. There was significant disruption of commonly used services like YouTube and Gmail, as well as Google hosted applications like Snapchat. The incident currently associated with the outage, 19009, indicates a start time of 12:53 US/Pacific and a resolution time of 16:56 US/Pacific. LightStep Research’s ongoing synthetic testing shows that the impact was longer than the advertised incident report and provides an example of the type of evidence you can share with a cloud provider when discussing an outage.

Summary of Findings

From 11:48 to 11:53, access between us-east1 to GCS regional buckets in us-east1 was completely disrupted. From 11:48 to 12:10 latency for at least 50% of requests was significantly higher from us-east1 and us-central1 to GCS regional buckets in us-east1, us-central1, and europe-west2. From 12:10 to 14:53 access was significantly slower for 5% or more of requests both inside and outside of the us-east1 region. From 11:48 to 12:03 latency was also elevated for europe-west2 to europe-west2 regional bucket access.

us-east1 Metrics

Requests From Us East1 To Us East1 LightStep 𝑥 PM

This screenshot is from the LightStep application’s historical view, Streams. Latency is shown on the top with lines for 50th, 95th, 99th, and 99.9th percentile (p50, p95, p99, p99.9). For the p50 line, this means that 50% of requests took more than the displayed time. Similarly for p95, 5% of requests took more than the displayed time. Below the latency graph, the request rate is shown. For this test, there are 50 requests made every minute, leading to the displayed rate of slightly less than 1 request per second. At the bottom is the error rate percentage, meaning the number of errors divided by the total number of requests.

This graph shows requests from a Google Cloud Function in us-east1 to a Google Cloud Storage regional bucket in us-east1. Following the start of the outage, there is an approximately 5-minute gap where no requests are successfully made. Relatively quickly, about 22 minutes after the start of the outage, p50 latency has recovered to the previous normal value. However, p95 latency does not recover until approximately 2 hours and 43 minutes after the p50 recovery.

Requests From Us East1 To Europe West2 LightStep 𝑥 PM

This graph shows a similar sequence of events for requests from Google Cloud Functions in us-east1 to a regional bucket in europe-west2. However, this does not show the gap in requests, suggesting that requests to europe-west2 would have been more likely to succeed than same region requests, an interesting finding.

Requests From Us East1 To Us Central1 LightStep 𝑥 PM

This graph shows request from us-east1 to us-central1. The recovery in this case is less clear and there appears to be a further, though less severe (affecting only p99 and p99.9), disruption at the end of the displayed time window.

us-central1 Metrics

Requests From Us Central1 To Us Central1 LightStep 𝑥 PM

This graph shows us-central1 to us-central1 same region request traffic. Though the GCP incident states that the disruption was in the east, the central region internally was impacted through most of the outage window.

Requests From Us Central1 To Europe West2 LightStep 𝑥 PM

Traffic to europe-west2 from us-central1 shows the same pattern.

Requests From Us Central1 To Us East1 LightStep 𝑥 PM

As expected, impact from us-central1 to us-east1 is more severe in terms of peak latencies. The time frame matches the other observations.

europe-west2 Metrics

Requests From Europe West2 To Europe West2 LightStep 𝑥 PM

This graph showing same region requests from europe-west2 to europe-west2 shows that latency was disrupted in an unrelated region, for a duration matching the p50 recovery in other regions. From this, we can see that “high levels of network congestion in the eastern USA” also had a much broader impact than just us-east.

Conclusions

Real time observability of the performance of cloud service APIs is necessary to have timely understanding of the range and size of impact an outage has on your organization. Status page updates will often be delayed by tens of minutes and will not include enough detail to be actionable. Reliable high resolution graphs of performance enable you to understand impact outside what is documented on the status page — and have the hard conversations you need with your cloud providers (as well as the data to support your case).

Migrating to Microservices? Here’s How to Have Reliable APIs from Day One

Starting the migration from monolith to microservices can be daunting. Still more daunting is to have spent a couple years on it and still not understand “what done looks like.” If you have an ORM-based monolith, there’s a strong temptation to do a data-first migration: to move a model or set of models into a CRUD service and then call it using HTTP instead of the database.

At first it seems like this is the easiest way to get to services and to “break the monolith.” The truth is that most often this path ends with a distributed monolith with tightly coupled models and APIs that, while tolerable, do not bring joy.

You can instead migrate using an API-first approach to create interfaces that you want to work with for years.

We Mean Business (Processes)

So, if not a data model-first migration, how do you start? Begin by asking what are well understood business processes both in terms of logic and the data which is transformed. For those processes, what methods exist to expose that data for transformation, and what other processes transform the same data? By implementing and testing the business process and potentially also the data stored as an API, you can see how working with it feels — or if it’s not quite the right way to look at the process of change in the system.

Keeping clear on which type of service and API you’re creating helps you make better tradeoffs in design and implementation. For example, a business process service concentrates the business logic for a set of transformations to data. A data store service concentrates data that is closely interrelated or generally changes together.

Fail Faster

You may find yourself creating data store APIs in your monolith, as for you it’s the easiest way to expose data for transformation with external business process APIs. That’s ok. You may find yourself using data store APIs from the monolith. That’s ok too — as long as the data store API is a designed interface, not just a copy of a model. (Implementing a “anti-corruption layer” will help you keep from being dependent on assumptions built into the current set of models.)

Migrating to Microservices? Here’s How to Have Reliable APIs from Day One - migrations compared

Taking an API-first approach forces the choice of what kind of service something will be. Instead of copying models across and then putting CRUD on top of them, you need to decide how to interact with the data first, what scope it has, which transformations you’ll want to expose and which you’ll want to hide. Then, more importantly, implement the API as quickly as possible and see how it is to use from the current clients or monolith.

I’m a fan of gRPC/Protobuf-driven design for quick prototyping with tools like Truss. Even if you don’t want to use Protobuf as an interface or gRPC as an RPC method, it can help you do API-first testing quickly. It’s always acceptable to “burn the prototype” and start over once you’ve validated the API design.

Figuring out in days — instead of years — that the API you originally designed is unwieldy when doing common activities is career changing. It’s life changing.

Separate and Validate

If the cost to set up a new service in your environment is high, you may end up combining business process and data store APIs on a single service, but I strongly suggest that you keep them distinct from a modeling and implementation standpoint. The business process part of the service should use the data store APIs to do data transformations, instead of going directly against the underlying data store. You want to be continually validating your APIs’ practical usefulness and refactoring if they fall short.

Perfect Is the Enemy of Good

Taking an API-first approach to microservices migrations allows you to learn faster, to develop APIs you want to work with, and gives a clear place to address data model / interface tech debt. Rapid prototyping and validation of assumptions about use and performance will save you months or even years of time.

Don’t try to design the perfect migration plan or even the perfect interface. Prototype, use, change, evolve. Building services this way gives you a taste of the future agility you’ll experience and starts you down that path.

How to Have Tough Conversations with Cloud Providers, Vendors, and Everyone Else You Are Paying

Error rates skyrocket. A critical service slows. SLAs breach by the dozen.

When things break — and one of your vendors is clearly at fault — even the most seasoned engineers can lose their cool.

“What the heck is going on! Why didn’t we know about this? Fix it! Fix it now!”

In situations like this, it’s easy to get upset. But, of course, this won’t help anyone.

 

You turn the screws,
You tear down the bridge,
Flimsy as it is,
It’s business like
— Cake, You Turn the Screws

 

Effective vendor relations are ultimately about the timely resolution of issues. Yet, this is only made possible through strong relationships, data-driven communications, and empathy on both sides.

When you build your business on other companies’ services (cloud storage, APIs, etc.), you get to focus on your unique value, but you also become dependent on other people — not on your team — quickly fixing issues that are breaking your business. In order to do that, they need to trust you, and you need to provide them with accurate information.

Vendors, Vendors Everywhere

The rate of business growth today, particularly in startups, is driven largely by effective use of vendors. A great idea even with great developers can only go so far. There are many other things necessary to make customers aware of a product, to deliver that product, and to understand how that product is used. You might use a cloud platform provider for compute or serverless, managed databases, or APIs for payments, telecommunications, chat, or shipping. Even though these vendors are not part of the core value proposition for a company, they will be critical to the customer experience and the company brand.

Over the past five years I’ve acted as the primary technical contact for many vendors including cloud, observability, alerting, and auditing companies. I’d like to share some key lessons that can make everyone happier in the relationship (or at least understand what they’re not happy about.)

1. Start With Data

Whether your point of contact with a vendor is email support, a web form, chat, or paging a TAM, having data in a consistent format that’s been shown to be effective is the fastest way to get something fixed. Time series charts with annotations often provide significantly faster escalation and resolution.

Many people seem to believe that if the service is down for them, it’s down for everyone and the vendor must clearly know. This is rarely the case. You may be the only company that has business impact or that has the observability tooling to see the issue (or both).

Thus, every initiation of vendor contact will be more successful if it includes:

  1. A screenshot / link / graphical example of the behavior to be discussed
  2. A textual explanation of how that information is being interpreted
  3. What that means to you as a customer, either realized impact or projected short-term risk for impact to your business
  4. A specific question or requested action
The Secret to Effective Relationships with Cloud Providers, Vendors, and Everyone Else You Are Paying - example vendor communication
An Example Vendor Communication

2. Be a Partner

When something is seriously wrong, there is a lot of stress, and it’s often the case that you’re feeling more pain than your vendor. Still, it is generally much more effective to delay a desire to “turn the screws” until well after the incident. Treating the contacts at the vendor as partners (rather than outsiders) who are working together with you to solve a shared issue has worked magic for me. Time and time again. It allows you to easily ask questions like the following and get real answers:

  • What other information can I share with you to help make the problem clearer?
  • Are you seeing the same thing?
  • How are you currently prioritizing this?
  • How can we help?
  • Is there anything that isn’t as impacted?

Vendors have to contextualize what you see into their systems which often look quite different from your view. Every opportunity you have to encourage their belief of and empathy with your view of their systems is a way to accelerate resolution.

3. Admit Error

Sometimes it really isn’t the vendor’s systems. The faster you can let them know that you’ve found the issue and that it wasn’t them, the less resources they waste and the more they view you as trustworthy — even if you were wrong this time. If you don’t update a vendor in a timely fashion, they can waste many hours trying to find something that doesn’t exist.

Fast forward to the next time you ask for help and request they prioritize resources for you: Your vendor will hesitate. They will likely place someone else’s needs above your own. Don’t let that happen.

A Need for Observability

When you look to interact with any vendor, especially in a real time scenario, you must think through how you’ll know how your usage of that vendor offering is actually working. You also need to understand how to see that interaction in the context of your key business transactions. Without these two pieces of information, you can’t take a proper data-driven approach, nor can you be a good partner. You will also find yourself not wanting to admit error or even being able to, since you’ll be unable to find causes for issues.

There is a common theme to all three lessons: without observability that provides consistent, trustworthy, actionable information, you cannot have effective vendor relations.

How to Get Started with Chaos: A Step-by-Step Guide to Gamedays

When you first start deploying applications in the cloud, it can feel amazing. You just tell the system to do something and suddenly your code is available to everyone. A bit later though, you’ll likely experience failure. It could be failure of the instance running the code, networking to clients, networking to databases, or something else.

After a while, that failure can seem like Chaos: uncontrolled, unpredictable, unwelcome.

Enter Chaos

It’s often from this place that you may hear about Chaos Engineering and wonder “why would I ever want to do that?!” Chaos Engineering seeks to actively understand the behavior of systems experiencing failure so that developers can decide, design, implement, and test resilience strategies. It grows out of knowing that failure will happen, but you can choose to see it with a clear head at 2 p.m. instead of confused, half awake, and stressed out at 2 a.m.

“Everything fails all the time”
— Werner Vogel, VP & CTO at Amazon

 

Chaos Gamedays

Chaos Gamedays are an ideal way to ease into Chaos Engineering. In a Chaos Gameday, a “Master of Disaster” (MoD) decides, often in secret, what sort of failure the system should undergo. He or she will generally start with something simple like loss of capacity or loss of connectivity. You may find, like we did, that until you can easily and clearly see the simple cases, doing harder or more complex failures is not a good way to build confidence or spend time.

So, with that said, let’s take a look at how to run a gameday.

Chaos Gameday: Planned Failure

With the team gathered in one room (physical or virtual), the MoD declares “start of incident” and then causes the planned failure. One member of the team acts as first on-call and attempts to see, triage, and mitigate whatever failure the MoD has caused. That person is strongly encouraged to “page” other members of the team and bring them in to help understand what’s happening. Ideally the team will find and solve the issue in less than 75% of the allocated time. When that has been done or the time allocated for response has ended, the MoD will reverse the failure and the team will proceed to do a post mortem of the incident.

Chaos Gameday: Escalation

It is entirely possible that, when starting out, the team will be unable to find or solve the problem. The Master of Disaster can escalate the failure to make it more visible, because often full outages are the only observable failures. Don’t be too worried if this happens: Observability that hasn’t been tested for failure scenarios often does not show them. Knowing this is the first step in fixing your instrumentation and visualization, and ultimately giving your customers a better experience.

Chaos Gameday: Post Mortem

The post mortem should follow the usual incident process (if there is one) and/or follow best practices like PagerDuty’s. Effective post mortems is a broad topic, but I’d encourage you to include sharing perspectives, assumptions that were made during responses, and expectations that didn’t reflect the behavior of the system or observability tooling. Following out of the post mortem, you should have a set of actions the first fix any gaps in observability for the failure scenario. You also likely will have some ideas about how to improve resilience to that failure.

The key to the Chaos Gameday process is to, at the very least, repeat the failure and validate the specific changes to observability and resilience that were made to the application.

How Chaos Gamedays Can Transform Your Team

If you follow this process regularly, you will see a transformation in your team. Being first on-call for Chaos Gamedays, even though it’s not “real”, builds composure under pressure when doing on-call for production outages. Not only do your developers gain confidence in their understanding of the systems and how they fail, but they also get used to feeling and being ok with pressure.

Some concrete benefits:

  • A more diverse on call inclusive of those who do not feel comfortable with a “thrown in the deep end” learning process.
  • Developers encounter failure with up-to-date mental models of the behavior of systems, instead of just whenever they happened to be on call during a failure last.
  • Leaders have confidence that new team members are ready to handle on-call and have clear ways to improve effectiveness.

The transformation in systems is as dramatic. Developers, since they regularly experience failure as part of their job, start designing for failure. They consider how to make every change and every system observable. They carefully choose resilience strategies because the vocabulary of resilience is now something they simply know and speak.

It’s not that systems become resilient to the specific things done to a specific system in a Chaos Gameday for, they become resilient, by design, for all the scenarios that the developer knows exist and are likely.

Starting the Journey of Chaos Engineering is as simple as a “sudo halt”. Following the path will grow your team and your systems in ways that are hard to imagine at first, but truly amazing to see become real. If you would like confident on-call, happy developers, and resilient systems, I encourage you to start that journey. We’re happy to help. Feel free to reach out at @1mentat.

What Happens When Your Cloud Integration Starts without Observability

Cloud services have changed the way applications are developed. They allow teams to focus on their value proposition, product, and customers. As part of the evaluation for a cloud service, you might talk to friends, look at recent feature additions, speak with sales about their roadmap. You may be choosing a cloud service to offload the operational burden to someone else, but just because you’ve offloaded it doesn’t mean it can’t fail.

Even if, or perhaps especially if, you’re a small company, understanding what will happen to your customers and to your business if the cloud service fails becomes key. It pays to look at status pages, recent outages, or public post-mortems. However, the true test is when you integrate the new feature using a cloud service, and see how your workload and the performance (or failure) of the cloud service interact.

Delaying this assessment until late in a project creates substantial risk.

Mind the (Instrumentation) Gap

Working on a recent project, my team had created what we believed to be a scalable and resilient architecture. We were just starting to use Chaos Engineering to test out our resilience and observability. During the second round of testing I was responsible for determining the experiments to try. I thought it might be interesting to test how we would observe a third-party cloud component (let’s call it PipelineAPI) failure.

The result of the experiment was that we didn’t — and couldn’t — observe PipelineAPI failure.

After a couple of sprints dedicated to closing the instrumentation gap for all cloud services in the project, during normal operation we started to see significant performance variance by one of the cloud services, we’ll call LogsAPI, in the critical path. After dashboard screenshots, discussions with support, and eventually a meeting with the product manager for LogsAPI, it became clear that it was not designed to support our use case. We pivoted to another similar cloud service, we’ll call BigDataAPI, with all the instrumentation in place from the start. We observed consistent latency and consistent availability of the data in the data store with no change over time or increasing amounts of data. With this data we gained confidence that BigDataAPI would be able to support our use case and growth.

Instrumentation Is More Than the Code You Write

To be honest, we got lucky. We hadn’t thought through how we would observe the performance of third-party cloud services as we scaled out the system. We didn’t instrument some of the core functionality of our system because it was not the code we were writing or testing. If we’d launched without the testing and observability our system would have failed at even 5% of the target traffic level. Instead we scaled smoothly over 100x in the first two weeks and had a deep understanding of the performance and resilience of our system the entire time.

It is better not to be lucky. Measure the performance and availability of any cloud service you are designing into a solution — as soon as possible and well before full production deployment. Watching the trending of outlier performance, it becomes easier to see if the cloud service is keeping up with your testing or canary traffic, or not. Conversations with cloud service vendors about expectations and performance are easier when you have consistent high resolution data to support your observations and questions.

No matter what you are building, the customer has expectations of your system. Closely observing your cloud service integrations is the best and easiest way to make sure you meet those expectations.

How to Launch a Distributed Tracing MVP with Just 50 Lines of Code

There comes a time in every successful technology company’s life when there’s a realization that it’s not quite clear what’s happening in production, and that lack of clarity is impacting customers. That point may happen with a monolith, a distributed monolith, SOA, microservices, or, often, a mix of them all. Perhaps someone says “let’s do distributed tracing, it should solve all of our observability problems.” However when you look at the investment involved, you may end up thinking, “that’s hundreds of thousands of dollars in people’s time, not even counting the cost of the service, how can I possibly justify that?”

This post is for people facing this same question I faced three years ago.

Wearing the Customer’s Shoes

Working at Twilio on the Insight Engineering team, I had the opportunity to spend a few months looking at what it would take to “do distributed tracing.” Twilio had hundreds of services in several languages. There was significant “migration fatigue” after OS version, instance generation, and Classic to VPC moves. The appetite for another cross-team effort was low.

At first it seemed like an impossible problem: Distributed Tracing would require efforts across teams in different languages, different frameworks, all on different schedules. It was an impossible problem looking at it that way. But somehow I needed to find the MVP for getting started with distributed tracing.

Twilio has a saying, “wear the customer’s shoes.” Reflecting on this, I decided that the best way to do that was to start instrumenting as close to the customer as possible, at the API edge service. By starting there I would see what each customer was experiencing for the entire time our platform was handling the request for each endpoint and method. I could tag each trace after authentication so that we could see a particular customer’s experience. Even better, when we decided to instrument further into the services that handled any given request, we’d always have that “customer’s shoes” context to start with.

Getting to a Root Cause

In the spirit of minimum viable product, I put together a PR of less than 50 lines. For every request received, it would create a span that represented the amount of time it took us to respond to that request. It used a standard prefix indicating it was a public API request, the standard reference name for the API resource, and tagged the method and customer. I also wrapped every request where the API service was a client to other services tagged with the downstream service and method. After some experiments, including some Chaos testing in staging, I was cleared to deploy a canary to production.

Though we had metrics and simple histograms before, what we could see with this view — especially over time — was a game changer. The canary happened to be deployed during a performance issue with a downstream service. I was able to bring the cause of the issue to both the API and service teams quickly, and they were able to rollback within minutes. With this demonstration of the capability of tracing, there was suddenly interest in removing the API team from the critical path for identifying the root cause for performance regressions or outages.

A Playbook for Launching Your Distributed Tracing MVP

The 50-line PoC turned into a purposeful refactor of the request handling and client code to provide a simple single point of integration for tracing. Overall, the resulting changes were less than 200 lines of code and a bit more than one week of one engineer on the API team — substantially less than 20 or so person years of time it had originally appeared to be.

If you’re wondering how to get started with tracing, consider using this pattern as a playbook:

  1. Identify a part of your service that’s as close to your customer as you can get.
  2. Look for patterns in how that service receives requests that enable you to instrument once or at most a handful of times.
  3. Find trends in how that service makes requests to services, SaaS, and databases.
  4. Follow production deployment steps (staging, canaries, or whatever other risk management strategies your company uses) and start getting real data.
  5. Compare trace data with other metrics and understand the potential cause of differences.
  6. Observe the visibility of failure, either “naturally” or induced by Chaos testing.
  7. After you’ve found a key use case, continue to make measured investments driven by observed value.

While this approach is helpful, at some point you will face the challenge of perspective. If you only have the edge’s client perspective and the perspective of server differs, you’ll need to figure out how the truth lies between them.

In future posts, I’ll cover mobile- and browser-based perspectives, integrating a service mesh into your tracing, and methods for adding internal services using frameworks or middleware.

If you have any questions about getting started with a Distributed Tracing MVP, you can reach me on Twitter @1mentat.

Why I Joined LightStep

I still remember my first solo on-call experience at Twilio. I’d shadowed on call for two weeks out of the last eight. I was supposed to be prepared. It was around midnight when the page came in: There was an issue with networking in Australia, we were down. It was the beginning of the evening rush for critical customers. I represented cloud operations, I was supposed to know our cloud provider inside and out, to be the subject matter expert. Looking at the behavior of the network, I was lost. Gradually other teams who had moved their traffic out of Australia dropped off the call, but I was still there, trying to get resolution from cloud provider support for hours. To this day, I’m not sure what happened, but what I remember most of all was the feeling of helplessness, of uncertainty, of powerlessness.

As time has gone by and I’ve come to understand development and operations better, my passion has grown to make sure that when people, whether developers, operators, or devops, are put in that position of pressure or of needing to find an answer — they will have the tools and the training to understand how to take action. They should feel empowered, confident, and able. Through my time at Twilio and then Stitch Fix, I saw the impact that timely insights could provide. I saw that by training with Chaos GameDays instead of incidents, engineers with all levels of experience could become effective at operating their systems under stress. The growth went beyond that, they excelled at designing, developing, and validating systems to be operable.

I’ve seen not only the impact that great observability can have, but also the unique possibilities that LightStep enables with its Satellite Architecture. Teams can quickly understand from a handful of traces what each one needs to do to recover a positive customer experience as quickly as possible during an incident. Developers can see the behavior of their code in production and how it differs from their expectations. Support engineers can directly access and understand their customer’s experience of the service or platform. Everyone can build empathy for their customers and clients, whether internal or external.

I am excited to join LightStep to advocate for and with developers and operators. To share that there is a better way to experience their systems and how to get there. While I work with LightStep, much of the information and guidance will be helpful whatever part of the journey of observability you’re on. In the spirit of OpenTracing, the starting point is to develop a shared understanding of the problem we’re trying to solve, what makes it difficult, and how we can get better together.

I look forward to growing and learning with all of you.