Figure out what happened, and why - with postmortems
by Darius Koohmarey
When an issue occurs within an environment, there are usually stages by which you can resolve that issue, and within each stage there are generally many competing priorities:
- Detect - What exactly is happening
- Investigate - Where is it happening
- Triage - Why is it happening
- Mitigate - How can we restore service
- Resolve / root cause - What caused it to happen, and how can we stop it from happening again
The last thing you want to be worrying about during an issue occurring is the reporting you’re going to have to do after you resolve the issue.
Postmortems are a great way to ensure all of your stakeholders are aware of what happened, why it happened, and how it was resolved, and plans for preventing it from happening again. Lightstep recently announced several new features to our Notebooks offering which can help teams as both an investigative tool, as well as a postmortem tool, that is a record of your investigative steps to identify root cause of an issue. Within Lightstep Incident Response, we are excited to announce a new feature to automatically generate a postmortem for how your team responded to an incident.
Through the Lightstep platform, your teams are able to combine the investigative flow of an issue (Notebook postmortem) as well as steps your team took to resolve the issue. This provides the business a complete end-to-end view of how their teams responded during an incident or outage, and presents the findings in an easy to consume manner including images, text, and charts.
Within Lightstep Incident Response, once an incident is resolved the focus of the team turns to the postmortem to figure out what happened, why it happened, and what they are going to do to prevent it from happening again.
Autopopulated exec summary - A pre-populated section of the postmortem to summarize what went wrong, when it was identified, when (and how) it was resolved.
Customer service impact - who was affected and why. Lightstep offers suggestions for what information to include such as whether it was all customers on the apps or just specific geos like Europe or North America. Clear service definitions on incidents makes it clear who’s impacted.
Incident timeline - This section is pre-populated by Lightstep with the ability to incorporate additional information or updates to the timelines populated by the system
Detailed summary - In contrast to the executive summary, the detailed summary is designed to be shared with your technical colleagues. Here you can share screenshots, code snippets, or additional troubleshooting steps that you used to resolve this issue.
- Action items (actions to fix the root cause from happening again) - This is arguably one of the most important steps as the fastest way to resolve an issue is to prevent it from happening in the first place.
Lightstep’s latest update to its Incident Response platform provides teams with the ability to share information in-context of an incident, and kick off live collaboration channels to discuss what happened between all of the responders on that incident. Live meetings for postmortem completion is key to gathering all responder feedback on activities conducted for diagnosis and remediation, as well as agreement on the best action items for prevention. This helps your teams improve, so that when this happens again (spoiler: it will), you can be prepared for it.
Finally, with postmortems, you can create a library of knowledge to share findings with the broader team for review and to speed up the incident response process for similar issues in the future. Post mortems can be downloaded to flatfile or shared out as a link for review. They are also always available on the historical incidents for insight.
As every organization works to improve the availability and performance of their services, the reality is that outages and human error are inevitable even for the largest, most robust organizations. While every technical resource acts to the best of their knowledge, unforeseen effects of planned and unplanned changes are common. Post mortem processes ensure you are continually strengthening and improving the resilience of your services and systems. The continual improvement culture that allows these open discussions with a blameless fashion is key. After all, the fastest way to resolve an incident is to prevent it from happening entirely. So the next time an outage strikes, ensure your team is just as concerned with completing a post mortem as they are with resolving the issue.