As part of our 99 Percent Visible99 Percent Visible series, we had a conversation with Randy ShoupRandy Shoup, VP of Engineering and Chief Architect at eBay. During this session, Randy discussed:
three system-wide outages from his time at Google, Stitch Fix, and WeWork
the first question your team should ask when an outage happens
how to properly run a postmortem
With system-wide outages at large companies, Randy has almost perfected a postmortem system. Let’s look at the individual steps first:
Schedule a postmortem with all team leads.
Set aside a few hours in a room (or Zoom) whiteboarding reliability issues.
Bucket each issue brought to the table into individual themes
Assign each theme to a senior engineer to investigate. There should be a one-week timeline here to gather data, detail the issues, and recommend the next steps.
Once the week is complete, have the team get back together to review and prioritize the issues in terms of ROI and those in which need to be remediated immediately.
Remove all blame
When moving through the postmortem process above, the number one item to remember is to have healthy team interactions. Your team members should be able to bring up areas where they could have done better or when a system wasn’t working as intended without feeling as though they are being judged or blamed. This leans into psychological safety.
As Randy mentions, “The most important thing beyond anything else is psychological safety, which is the idea that team members feel safe to take risks and be vulnerable in front of each other.” The two main components here are creating a safe environment and increasing inclusiveness and diversity. When team members are in an inclusive, safe environment, it’s shown to increase efficiency and enhance decision-making.
There could be more than one answer
Looking at the postmortem as a whole, it’s important to not only identify the root cause but to identify the factors that contributed to the incident as well as the factors that made it difficult to diagnose, detect, remediate, or mitigate. Teams should keep in mind that there’s typically not one answer. There could be multiple factors that have added up to the issue. It’s important not to be too close-minded, but to be able to look outside of the box at times.
After looking at the issues and the factor(s) that led to the postmortem, have your team have concrete action items and takeaways. Outage postmortems are only helpful if your team lead or executive sponsor (or yourself) ensures that the next steps are properly prioritized, clearly actionable, and followed through. When prioritizing, team leads should not only prioritize the issues found in the outage but also prioritize against ongoing work as well.
Recipe for team success
When you and your team are able to put these key components together, you’ll be able to effectively and efficiently diagnose an outage, create an actionable plan, and improve team interactions. Although we would never wish an outage on anyone, we hope you find this blog and Randy’s full interviewRandy’s full interview helpful.
If you’re interested in the 99 Percent Visible series, keep an eye out for upcoming sessionsupcoming sessions!
June 24, 2021
3 min read
DevOps Best Practices
About the author
Lindsay NeesonRead moreRead more
Explore more articles
Exploring What Kubernetes Observability Might Look Like for SRE and Operations Teams in the FutureClay Smith | Oct 19, 2022
The exciting and new tracing capabilities now built-in for the internal components that power Kubernetes itself, which means that operators that need to diagnose tricky performance issues have some powerful new solutions.Learn moreLearn more
How to Define and Track Incident Management KPIsKeanan Koppenhaver | Oct 11, 2022
Incidents can have a serious impact on your business. Learn how to track key performance indicators (KPIs) to ensure that your organization is running smoothly.Learn moreLearn more
Overview of Site Reliability EngineeringLukonde Mwila | Sep 22, 2022
Site Reliability Engineering has become more common over the past few years, and many more are looking at it trying to understand what exactly it means. In this guide you’ll be covering this area, giving a high-level overview of SRE.Learn moreLearn more
Lightstep sounds like a lovely idea
Monitoring and observability for the world’s most reliable systems