Data-Driven Hypotheses with LightStep: A Step-by-Step Guide
by Robin Whitmore
It’s 4:00 a.m.
You get a notification from one of your VIP customers that your mobile app is slow. Needless to say, they are not happy.
There were a lot of deployments yesterday that included a change to the iOS client, but almost anything in the stack could be causing the issue.
With LightStep, in four quick steps, you’re able to come up with a data-driven hypothesis. The team is working on the fix and will have it out before lunch. As a bonus, you found an issue deep in the stack that you didn’t even know was there!
Let’s go back in time to see how you and your team handled it. (Cue dream music …)
Start at the Beginning
You believe the iOS client is where the trouble is, so you start by running a query on that service. Sure enough, there are a number of spans showing high latency. When overlaying how the same service looked 24 hours ago, you definitely see there’s a problem.
Step One: Complete. You've verified the problem. Now, what’s causing it?
Narrow and Correlate to Find the Culprit
There’s lots of span info to look through in the Trace Analysis table. Since you’re only concerned with spans showing high latency, you filter to see data only for those in the 95th percentile and above.
Concentrate the results on only the span data you care about.
LightStep has this great feature called Correlations that helps you quickly find culprits causing latency. LightStep analyzes the thousands of traces from your query, looking for patterns of latency in services, operations, and other metadata. Sure enough, it looks like the user-space-mapping operation on the api-server was contributing to 56% of the latency.
LightStep looks at all span data and computes correlations to your specific query.
Step Two: Complete. You've found the cause with Correlations. Now, is the iOS client the only thing affected by this operation? Probably not.
Look Upstream and Downstream
You rerun your query to search for the user-space-mapping operation on the api-server. You open the Service Diagram to see what’s going on up and down the stack from the service to if there’s anything else contributing to latency or affected by the api-server. Sure enough, the api-service is surrounded by a large yellow halo signifying a bunch of latency.
It looks like the webapp may also be affected since it’s directly upstream. And hey - the authentication service has some errors (that red halo jumps right out at you)! You notify the team in charge of the authentication service that they have errors to fix.
The api-server reports a large amount of latency and the auth-service’s red ring means errors.
Time to let Customer Support know that they may get some calls from web users too.
Step Three: Complete. You've found the latency contributors with the Service Diagram. Now, you want to view a trace to see every operation involved with the user-space-mapping, and to get a closer look at that operation itself.
Trace for the Win
You click on one of the spans to the left to jump right into a trace. The trace shows the critical path in yellow, and the user-space-mapping operation is a big part of it. Looking at all the contextual metadata on the right, you see in the logs that there’s an issue with the cache. It wasn’t the iOS service or even the code in the api-server. It was a network issue!
Rich metadata about the span saves the day!
Step Four: Complete. You've found the issue’s critical path. Now, you send a snapshot of the issue to the IT department showing the issue. Because it’s a Snapshot, that team will be seeing the same data you did, so no chance to pass the buck. The truth is in the data.
Not only did you come up with a very realistic data-driven hypothesis in less time than it takes to make a cup of coffee, you even unearthed errors that no one was aware of!
Coming soon: Part Two — find out how you can set up monitoring and alerts to make sure when the fix is pushed out, it actually solves the problem without introducing new ones.