Improve System Performance: How to Find Long-Running Operations
by Robin Whitmore
As a Performance Engineer, your main goal is to improve the speed of your company’s applications. But before you can ask for improvements, you need to know where the issues are. And you can’t waste developer time by just guessing where to start. Good news: Distributed tracing can help you identify end-to-end requests containing operations that contribute to latency.
Start with a core operation where improvement would have an impact. For example, most of your customers use your web app, so you use the Service Directory to see how that service is performing.
The Service Directory gives you an overview of operations and their performance on every service.
You notice that calls to api-request have increased in latency in the last day, so you select the operation with the highest increase (api-request/get-location) to run a query in Explorer to get a closer look.
Explorer creates latency buckets and shows span data from live transactions based on 100% of the span data from your instrumented services.
With this query, Lightstep shows span data from the selected operation. But you’d like to look at the data from all spans that participate in the same trace as the /get-locations operation in order to find out if the latency increase is caused by something upstream or downstream. First, select Show all Spans in Traces and then group by Operations to compare latency across all operations in the span. Finally, sort Latency from highest to lowest and voila!
In the Trace Analysis table, you can group by operations for comparison.
At first glance, the dom-load operation seems to have more latency. Let’s dive deeper to look at a trace to see where the critical path really is. Select the load span to view the trace.
The Trace view lets you immediately see the critical path through the trace.
You notice a few things here. First, Lightstep uses a yellow line to mark where requests are taking the most time. While there is some upfront latency in the dom-load operation, it’s only contributing a little over 2% of the latency. At the bottom of the stack, you notice that the get-bounds operation on the tile-db service is a large part of the critical path. Selecting that span shows that it’s contributing over 77% of the latency!
The Trace view also includes metadata from the span to help you discover root causes.
It’s obvious that latency in the get-bounds operation requires some investigation. It seems that the query “GET bounds WHERE region='usa'" is the problem, so you share the trace to that team to verify the hypothesis. When you share a view from Lightstep, you’re sharing a Snapshot of the data. Anyone using the link to view the trace will be using the same data you did when you ran your query. They can go to Explorer and see exactly what you saw.
Visiting a shared Snapshots means Lightstep uses saved data rather than live data so everyone’s on the same page.
On to the next improvement!