Announcing Notebooks: collaborate and troubleshoot issues with metrics and traces together
by Rakesh Patel
Think back to the last time your team experienced a high-severity production incident. It may not have been obvious where the issue was, and it may have taken several tries to figure out how to successfully mitigate it.
Meanwhile, the folks who needed to weigh in on the incident might have been distributed across several time zones, requiring you to share your findings across multiple teams and over the course of hours or even days. This can bring your investigation, and mitigation, to a screeching halt and a wide variety of problems.
Problem 1: Your observability tool’s full-data-retention window may have elapsed before your investigation finished, and you might not have set it up to save the data you needed – data that might have contained crucial insights into the source of the problem.
Problem 2: The valuable insights that helped move your investigation forward may have been in both metric and trace data, so you may have needed to switch between tabs of the same product, or even across different products, to search for and analyze this data.
Problem 3: Sometimes you might just have been stumped along the way and needed some help figuring out your next step, your next hypothesis to test, but your existing tools’ correlation and insight features were just turning up noise.
Lastly, you finished your investigation and want to share your investigative steps and insights in a postmortem, but you’re finding it hard to tell that narrative.
We’re excited to announce a new feature in Lightstep Observability called Notebooks which addresses the needs that arise throughout the course of a team’s troubleshooting journey. For example:
- Make educated guesses - An anomaly occurs, and you have a hunch about what part of the system to investigate first, but limited data to support that hunch. With Notebooks, you can form hypotheses by creating and visualizing ad-hoc charts about your system, whether the underlying data is metrics or traces. Truly unified observability!
- An hour was never enough - A few minutes into the issue, you realize it’s going to take a bit longer to investigate. With Notebooks, you can look back at the system with 100% trace data retention for 3 days.
- Solve it together - You can share a link to your Notebook link with your colleagues so they can immediately understand your investigative steps so far and suggest steps to take.
- Modern distributed systems are complex - Sometimes it’s not obvious where the connections are between different parts of the system. With Notebooks you can generate powerful trace-based correlations across your system that can help you move your investigation forward if you get stuck.
- Crisis averted - The anomaly has now been mitigated. With Notebooks, you can take the charts and insights you generated throughout your investigation into a postmortem days or weeks later so your team can discuss what happened and take long-term steps to prevent it from happening again.
Improve your postmortems with granular, context-specific data
With Notebooks, here’s what an investigation should look like, following a spike or a change in your data from a dashboard:
1. Expand the anomalous Dashboard chart, and open it in a Notebook by clicking “Add to notebook” at the top of the chart:
2. Welcome to Notebooks, where it’s a lot easier to run ad-hoc queries across both metric and trace data to test your hypotheses and learn more about the anomaly. The last 3 days of trace data are available to query, as are the last 13 months of metric data.
3. In Notebooks, you can leverage Change Intelligence on Notebook charts to generate hypotheses about where the issue might be, and how to mitigate it.
4. Once you’ve narrowed down the list of possible causes, you can save and share your Notebook with your colleagues so they can follow your investigative steps and help you out with your investigation.
5. Finally, you can save your Notebook as a snapshot to memorialize your investigative steps for use in a postmortem (and don’t worry if you forget to do this – we do it automatically too!)
Collaborate to resolve issues in real-time
We believe Notebooks are an indispensable tool for engineering teams working together to investigate, mitigate, and resolve incidents. We’re constantly working on even more ways for you to use Notebooks during the course of an investigation!
Lightstep notebooks enable DevOps and SRE teams to run ad-hoc queries while investigating an incident or proactively optimizing their application. Notebooks, which leverage Change Intelligence, allows any developer, operator, or SRE to instantly understand changes in their service’s health and – most importantly – what caused those changes. Teams can quickly share and collaborate on findings for faster incident resolution, and proactive optimization. This is critical when investigating an incident, collaborating across teams, and quickly documenting learnings via notebooks to share across the organization.