Lightstep from ServiceNow Logo

Products

Solutions

Documentation

Resources

Lightstep from ServiceNow Logo
< all blogs

Improve System Performance: How to Find Long-Running Operations

As a Performance Engineer, your main goal is to improve the speed of your company’s applications. But before you can ask for improvements, you need to know where the issues are. And you can’t waste developer time by just guessing where to start. Good news: Distributed tracingDistributed tracing can help you identify end-to-end requests containing operations that contribute to latency.

Get an Overview of Service Performance

Start with a core operation where improvement would have an impact. For example, most of your customers use your web app, so you use the Service DirectoryService Directory to see how that service is performing.

The Service Directory gives you an overview of operations and their performance on every service.

 

The Service Directory gives you an overview of operations and their performance on every service.

You notice that calls to api-request have increased in latency in the last day, so you select the operation with the highest increase (api-request/get-location) to run a query in ExplorerExplorer to get a closer look.

Explorer creates latency buckets and shows span data from live transactions based on 100% of the span data from your instrumented services.

 

Explorer creates latency buckets and shows span data from live transactions based on 100% of the span data from your instrumented services.

Group for Comparisons

With this query, Lightstep shows span dataspan data from the selected operation. But you’d like to look at the data from all spans that participate in the same trace as the /get-locations operation in order to find out if the latency increase is caused by something upstream or downstream. First, select Show all Spans in Traces and then group by Operations to compare latency across all operations in the span. Finally, sort Latency from highest to lowest and voila!

In the Trace Analysis table, you can group by operations for comparison.

 

In the Trace Analysis table, you can group by operations for comparison.

At first glance, the dom-load operation seems to have more latency. Let’s dive deeper to look at a trace to see where the critical path really is. Select the load span to view the trace.

The Trace view lets you immediately see the critical path through the trace.

 

The Trace view lets you immediately see the critical path through the trace.

Trace View Provides the Details

You notice a few things here. First, Lightstep uses a yellow line to mark where requests are taking the most time. While there is some upfront latency in the dom-load operation, it’s only contributing a little over 2% of the latency. At the bottom of the stack, you notice that the get-bounds operation on the tile-db service is a large part of the critical path. Selecting that span shows that it’s contributing over 77% of the latency!

The Trace view also includes metadata from the span to help you discover root causes.

 

The Trace view also includes metadata from the span to help you discover root causes.

Share Snapshots of Data

It’s obvious that latency in the get-bounds operation requires some investigation. It seems that the query “GET bounds WHERE region='usa'" is the problem, so you share the trace to that team to verify the hypothesis. When you share a view from Lightstep, you’re sharing a Snapshot of the data. Anyone using the link to view the trace will be using the same data you did when you ran your query. They can go to Explorer and see exactly what you saw.

Example of a Lightstep Snapshots

 

Visiting a shared Snapshots means Lightstep uses saved data rather than live data so everyone’s on the same page.

On to the next improvement!

Interested in joining our team? See our open positions herehere.

October 2, 2019
3 min read
Monitoring

Share this article

About the author

Robin Whitmore

Robin Whitmore

Read moreRead more
Monitoring

Kubernetes vs Docker Swarm: Which is better?

Austin Parker | Mar 19, 2020

You may be looking into the pros and cons of Kubernetes vs Docker Swarm. Both platforms are excellent, but they both have qualities that are unique to each other. What exactly are Kubernetes and Docker Swarm? Let’s dive in and learn a bit more.

Learn moreLearn more
Monitoring

Managing SLOs and SLIs in Lightstep

Ashley Rahimi Syed | Mar 18, 2020

This blog will show step-by-step how Lightstep can help you monitor and meet your Service-Level Objectives (SLOs) and resolve incidents quickly.

Learn moreLearn more
Monitoring

How Lightstep’s Slack Integration Makes It Easier to Resolve Performance Regressions

Ashley Rahimi Syed | Jan 28, 2020

If you find a performance issue or regression, you can quickly troubleshoot it with your team using Lightstep’s Slack integration. We’ve made it easy to establish shared context with your entire organization – right from the app!

Learn moreLearn more
THE CLOUD-NATIVE RELIABILITY PLATFORM

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems