Introducing Snapshot Analyzer: Interactive Investigations for Deep Systems
by Karthik Kumar
Distributed tracing generates a stream of rich, contextual data. But as systems grow in scale and complexity, it can be challenging to navigate through thousands of traces to quickly find answers to performance questions.
When things are on fire, how do you know if your hypothesis is a good one?
To help answer that question, we built Snapshot Analyzer. It’s a simple way to investigate cross-service performance at scale.
With Snapshot Analyzer, you can filter comprehensive views of complete system behavior (Snapshots) across any dimension in your system, and group cross-service traces by any attribute.
Think of it as being able to perform SQL-like operations on large amounts of trace data.
When you’re investigating an issue in a complex or deep system, it can be difficult to narrow the scope to whatever is the most likely culprit.
Tracing provides context for end-to-end requests, which often include multiple services. With Snapshot Analyzer, you can filter these traces by one or more services, operations, or tags and focus your analysis on only the traces relevant to your investigation.
In the above gif, we currently have a Snapshot of the most recent traces from our system. From here, we can filter by the tags
error: true and
canary:trueto find only the traces that are returning errors after a canary release.
Digging into a single trace, we can quickly glance at the right logs and find the issue! Additionally, the suggestions themselves are scoped to only the ones matching the provided filter. This allows you to perform flexible, exploratory analysis on complex tracing data.
In certain (scary) situations, you may be notified of an issue by your customer, and you don’t know where to begin investigating.
Fear not! Backed by Correlations, Snapshot Analyzer can help you reduce distractions from red herrings and speed up root cause identification.
Correlations automatically surfaces attributes that are associated with latency. With Snapshot Analyzer, you can dive deeper into the insights provided by Correlations, add additional filters, and focus your investigation on rapid hypothesis generation and validation.
Snapshot Analyzer also allows you to group traces that share a certain attribute and compare performance characteristics across groups.
So, how does this work?
In the example below, we’re investigating what we initially guess is a database issue. We start by filtering the set of traces down to only those that have a
db.type=cassandra tag. We then group these traces by
region to see aggregate statistics across both
us-west-1. The difference in error percentage and average latency tell us that the issues is actually region-specific. We can dig into a trace in this region to get the context we need to mitigate the issue. This ability to group traces by a tag of any cardinality is invaluable to quickly corroborating or eliminating hypotheses.
Snapshot Analyzer provides the ability to view additional contextual information across traces. The Add Column feature allows the user to view the value for any tag in the Trace Analysis table. Having this information next to the individual spans helps identify patterns and anomalies in your system. It can expedite hypothesis generation when performing root cause analysis by helping you narrow down the trace search space.
Go to lightstep.com/play and use Snapshot Analyzer to resolve a performance regression in under 10 minutes.