As a Performance Engineer, your main goal is to improve the speed of your company’s applications. But before you can ask for improvements, you need to know where the issues are. And you can’t waste developer time by just guessing where to start. Good news: Distributed tracingDistributed tracing can help you identify end-to-end requests containing operations that contribute to latency.
Get an Overview of Service Performance
Start with a core operation where improvement would have an impact. For example, most of your customers use your web app, so you use the Service DirectoryService Directory to see how that service is performing.
The Service Directory gives you an overview of operations and their performance on every service.
You notice that calls to api-request have increased in latency in the last day, so you select the operation with the highest increase (api-request/get-location) to run a query in ExplorerExplorer to get a closer look.
Explorer creates latency buckets and shows span data from live transactions based on 100% of the span data from your instrumented services.
Group for Comparisons
With this query, Lightstep shows span dataspan data from the selected operation. But you’d like to look at the data from all spans that participate in the same trace as the /get-locations operation in order to find out if the latency increase is caused by something upstream or downstream. First, select Show all Spans in Traces and then group by Operations to compare latency across all operations in the span. Finally, sort Latency from highest to lowest and voila!
In the Trace Analysis table, you can group by operations for comparison.
At first glance, the dom-load operation seems to have more latency. Let’s dive deeper to look at a trace to see where the critical path really is. Select the load span to view the trace.
The Trace view lets you immediately see the critical path through the trace.
Trace View Provides the Details
You notice a few things here. First, Lightstep uses a yellow line to mark where requests are taking the most time. While there is some upfront latency in the dom-load operation, it’s only contributing a little over 2% of the latency. At the bottom of the stack, you notice that the get-bounds operation on the tile-db service is a large part of the critical path. Selecting that span shows that it’s contributing over 77% of the latency!
The Trace view also includes metadata from the span to help you discover root causes.
Share Snapshots of Data
It’s obvious that latency in the get-bounds operation requires some investigation. It seems that the query “GET bounds WHERE region='usa'" is the problem, so you share the trace to that team to verify the hypothesis. When you share a view from Lightstep, you’re sharing a Snapshot of the data. Anyone using the link to view the trace will be using the same data you did when you ran your query. They can go to Explorer and see exactly what you saw.
Visiting a shared Snapshots means Lightstep uses saved data rather than live data so everyone’s on the same page.
On to the next improvement!
Interested in joining our team? See our open positions herehere.
October 2, 2019
3 min read
About the author
Robin WhitmoreRead moreRead more
Explore more articles
Kubernetes vs Docker Swarm: Which is better?Austin Parker | Mar 19, 2020
You may be looking into the pros and cons of Kubernetes vs Docker Swarm. Both platforms are excellent, but they both have qualities that are unique to each other. What exactly are Kubernetes and Docker Swarm? Let’s dive in and learn a bit more.Learn moreLearn more
How Lightstep’s Slack Integration Makes It Easier to Resolve Performance RegressionsAshley Rahimi Syed | Jan 28, 2020
If you find a performance issue or regression, you can quickly troubleshoot it with your team using Lightstep’s Slack integration. We’ve made it easy to establish shared context with your entire organization – right from the app!Learn moreLearn more