Lightstep from ServiceNow Logo





Lightstep from ServiceNow Logo
< all blogs

Data-Driven Hypotheses with Lightstep: A Step-by-Step Guide

It’s 4:00 a.m.

You get a notification from one of your VIP customers that your mobile app is slow. Needless to say, they are not happy.

There were a lot of deployments yesterday that included a change to the iOS client, but almost anything in the stack could be causing the issue.

With Lightstep, in four quick steps, you’re able to come up with a data-driven hypothesis. The team is working on the fix and will have it out before lunch. As a bonus, you found an issue deep in the stack that you didn’t even know was there!

Let’s go back in time to see how you and your team handled it. (Cue dream music …)

Start at the Beginning

You believe the iOS client is where the trouble is, so you start by running a query on that service. Sure enough, there are a number of spans showing high latency. When overlaying how the same service looked 24 hours ago, you definitely see there’s a problem.

Lightstep Explorer View - The overlayed blue line shows the change from the same time period one day ago.

Step One: Complete. You've verified the problem. Now, what’s causing it?

Narrow and Correlate to Find the Culprit

There’s lots of span infospan info to look through in the Trace Analysis table. Since you’re only concerned with spans showing high latency, you filter to see data only for those in the 95th percentile and above.

Lightstep Latency Histogram P95 - Concentrate the results on only the span data you care about.

Concentrate the results on only the span data you care about.

Lightstep has this great feature called CorrelationsCorrelations that helps you quickly find culprits causing latency. Lightstep analyzes the thousands of traces from your query, looking for patterns of latency in services, operations, and other metadata. Sure enough, it looks like the user-space-mapping operation on the api-server was contributing to 56% of the latency.

Lightstep looks at all span data and computes correlations to your specific query.

Lightstep looks at all span data and computes correlations to your specific query.

Step Two: Complete. You've found the cause with Correlations. Now, is the iOS client the only thing affected by this operation? Probably not.

Look Upstream and Downstream

You rerun your query to search for the user-space-mapping operation on the api-server. You open the Service Diagram to see what’s going on up and down the stack from the service to if there’s anything else contributing to latency or affected by the api-server. Sure enough, the api-service is surrounded by a large yellow halo signifying a bunch of latency.

It looks like the webapp may also be affected since it’s directly upstream. And hey - the authentication service has some errors (that red halo jumps right out at you)! You notify the team in charge of the authentication service that they have errors to fix.

Lightstep Service Diagram View - The api-server reports a large amount of latency and the auth-service’s red ring means errors.

The api-server reports a large amount of latency and the auth-service’s red ring means errors.

Time to let Customer Support know that they may get some calls from web users too.

Step Three: Complete. You've found the latency contributors with the Service Diagram. Now, you want to view a trace to see every operation involved with the user-space-mapping, and to get a closer look at that operation itself.

Trace for the Win

You click on one of the spans to the left to jump right into a trace. The trace shows the critical path in yellow, and the user-space-mapping operation is a big part of it. Looking at all the contextual metadata on the right, you see in the logs that there’s an issue with the cache. It wasn’t the iOS service or even the code in the api-server. It was a network issue!

Lightstep shows rich metadata about the span.

Rich metadata about the span saves the day!

Step Four: Complete. You've found the issue’s critical path. Now, you send a snapshot of the issue to the IT department showing the issue. Because it’s a SnapshotSnapshot, that team will be seeing the same data you did, so no chance to pass the buck. The truth is in the data.


Not only did you come up with a very realistic data-driven hypothesis in less time than it takes to make a cup of coffee, you even unearthed errors that no one was aware of!

Coming soon: Part Two — find out how you can set up monitoring and alerts to make sure when the fix is pushed out, it actually solves the problem without introducing new ones.

Interested in joining our team? See our open positions herehere.

October 15, 2019
4 min read

Share this article

About the author

Robin Whitmore

Robin Whitmore

Read moreRead more

How to Operate Cloud Native Applications at Scale

Jason Bloomberg | May 15, 2023

Intellyx explores the challenges of operating cloud-native applications at scale – in many cases, massive, dynamic scale across geographies and hybrid environments.

Learn moreLearn more

2022 in review

Andrew Gardner | Jan 30, 2023

Andrew Gardner looks back at Lightstep's product evolution and what's in store for 2023.

Learn moreLearn more

The origin of cloud native observability

Jason English | Jan 23, 2023

Almost every company that depends on digital capabilities is betting on cloud native development and observability. Jason English, Principal Analyst at Intellyx, looks at the origins of both and their growing role in operational efficiency.

Learn moreLearn more

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems