The Anatomy of Three Incidents: Randy Shoup on Outages and Postmortems
by Lindsay Neeson
- three system-wide outages from his time at Google, Stitch Fix, and WeWork
- the first question your team should ask when an outage happens
- how to properly run a postmortem
With system-wide outages at large companies, Randy has almost perfected a postmortem system. Let’s look at the individual steps first:
- Schedule a postmortem with all team leads.
- Set aside a few hours in a room (or Zoom) whiteboarding reliability issues.
- Bucket each issue brought to the table into individual themes
- Assign each theme to a senior engineer to investigate. There should be a one-week timeline here to gather data, detail the issues, and recommend the next steps.
- Once the week is complete, have the team get back together to review and prioritize the issues in terms of ROI and those in which need to be remediated immediately.
When moving through the postmortem process above, the number one item to remember is to have healthy team interactions. Your team members should be able to bring up areas where they could have done better or when a system wasn’t working as intended without feeling as though they are being judged or blamed. This leans into psychological safety.
As Randy mentions, “The most important thing beyond anything else is psychological safety, which is the idea that team members feel safe to take risks and be vulnerable in front of each other.” The two main components here are creating a safe environment and increasing inclusiveness and diversity. When team members are in an inclusive, safe environment, it’s shown to increase efficiency and enhance decision-making.
Looking at the postmortem as a whole, it’s important to not only identify the root cause but to identify the factors that contributed to the incident as well as the factors that made it difficult to diagnose, detect, remediate, or mitigate. Teams should keep in mind that there’s typically not one answer. There could be multiple factors that have added up to the issue. It’s important not to be too close-minded, but to be able to look outside of the box at times.
After looking at the issues and the factor(s) that led to the postmortem, have your team have concrete action items and takeaways. Outage postmortems are only helpful if your team lead or executive sponsor (or yourself) ensures that the next steps are properly prioritized, clearly actionable, and followed through. When prioritizing, team leads should not only prioritize the issues found in the outage but also prioritize against ongoing work as well.
When you and your team are able to put these key components together, you’ll be able to effectively and efficiently diagnose an outage, create an actionable plan, and improve team interactions. Although we would never wish an outage on anyone, we hope you find this blog and Randy’s full interview helpful.
If you’re interested in the 99 Percent Visible series, keep an eye out for upcoming sessions!