Finding new investigative routes with Change Intelligence
by Rakesh Patel
- Traffic spike > Investigate
- Rule out 1st potential cause
- Rule out 2nd potential cause
- Run out of ideas > grab a snack
- Ask Lightstep
- Lightstep suggests biggest changes in traces
It’s a pretty familiar scenario for anyone when an alert triggers, or a spike on a dashboard shows up - click into your observability (or monitoring) solution of choice and begin investigating. You have a bunch of questions: what's the blast radius? What part of the system is causing the issue? Can I take a quick remediation action? Hopefully, there are some intelligent workflows that allow you to ask and find answers to these questions, so you can triage and remediate quickly. And once an issue is resolved, how do you ensure it doesn’t happen again? Your remediation fixed the issue in the short-term, triage identified the likely area, but you haven’t begun your investigation into the root cause because you’ve spent the last hour (or longer) getting the system back to health.
In 2021, Lightstep announced Change Intelligence. Change Intelligence automatically opens up new investigative pathways for experts and novices alike. When a spike occurred (or an alert triggered), you had an option within Lightstep to quickly, and confidently, narrow down where you’re looking. Is it upstream or downstream? Is it your service, a service you’re dependent on, or a service owned by another team? Is it a particular customer who doubled their usual workload? Change Intelligence gave teams an option to quickly find differences between a baseline set of traces, and a set of traces from a deviation. Today, we’re announcing that we’ve made it even better (obviously), and more accessible to every user than ever before.
A fundamental question asked of Observability platforms is “What happened and why?”
A common scenario we’ve encountered (internally and with customers), is a brick wall in your investigation - a dead end. With today’s announcement, you can access Change Intelligence within Lightstep Notebooks by clicking the “Analyze Deviation” button to instantly generate powerful trace-based throughput correlations across your system that can help you move your investigation forward if you get stuck. These correlations help you instantly understand changes in your services’ health and – most importantly – what might have caused those changes.
Observability is all about asking questions of your data. Notebooks is a powerful ally for any observability practitioner. Not only does Notebooks allow you to form strong hypotheses across all your metrics and traces -- it also allows you to query the most interesting and useful data to understand customer experience (whether that’s debugging p50 performance regressions or tracking down the root causes of a once-in-a-blue-moon error), while doing so in a collaborative fashion. A challenge with debugging massively distributed systems is what happens when you’re not the expert for the service you’re investigating? Enter Change Intelligence (again, obviously!).
Modern distributed systems are complex - Sometimes it’s not obvious where the connections are between different parts of the system. With Change Intelligence, any developer, operator, or SRE can reason about the system as if they were an expert. Moreover, you can bring any chart or insight generated from Change Intelligence back into your Notebook for the investigation to complete the narrative for your postmortem. This is just one way that Notebooks addresses the needs that arise throughout the course of a team’s troubleshooting journey by providing granular, context-specific data and facilitating the ability to collaborate to resolve issues in real-time, without breaking your investigative flow. Analysis - in context - reduces mean time to resolution (MTTR) and drives proactive performance improvements.
Lightstep Notebooks will quickly become your favorite tool when troubleshooting, debugging, investigating, and optimizing anything. With throughput correlations on span charts, a super simplified way to select a deviation on a chart and access Change Intelligence, and a comprehensive list of correlations across the system ranked by strength with the ability to get to the underlying data in a single click, you’ll never run into a dead end in your investigation ever again.