Getting Started with Incident Response
by Michelle Ho
There’s been a lot of chatter around incident response these last few years, so much so that it can be overwhelming. Mature teams can now choose from a whole ecosystem of tools, dashboards, alerting systems, paging systems, root cause detection systems, etc. But what if you’re a young team that’s just starting to standardize your incident response? What does an MVP incident response process look like? We talked to Kevin Riggle, an incident response expert who has worked with companies such as Akamai and Stripe. He gave us seven tips to get started with:
While this isn’t always feasible, Kevin is a big believer that everyone involved in building the product should have experience being on call. Having a more expansive on-call rotation not only lightens the load for everyone and makes the company more resilient, it also aligns incentives with the builders. You’re going to be a lot more willing to invest in code review and integration testing if you’re the one being paged when the code breaks. What you want to establish, he says, is a sense that “we’re all responsible here, we’re building big things that have a lot of people depending on them, and we have to act like it.”
As your company grows, you’ll draw from experience and wind up with a more formal process. Riggle says incidence management is not something you can teach off a slide deck, or even in a classroom setting. Part training process and part hazing ritual, he had to learn, mobilize, and adapt on the fly. In addition to apprenticing, he recommends listening to seasoned managers tell “war stories” about their past incidents.
There are a number of good options here, but for simplicity Kevin recommends starting with an email list for internal incident response communications. You can complement this with an incident document where you write the tl;dr incident information necessary to get participants up to speed, and a Slack channel for side-channel conversations.
While you can get by with the janky email-and-google-docs setup, most companies do end up building or running some sort of webform that stores templates for the initial email and takes care of some of the paging mechanisms. It might also have added features like prepopulating a GoogleDoc and Slack channel. For a good all-in-one cloud solution, check out the LIR product.
If possible, avoid hosting response channels on your own infrastructure.
There are nearly a half dozen ostensible incident roles, but of them,__ only the incident commander is strictly necessary__.
The incident commander is responsible for guiding an incident to its resolution, managing the plan, communication, and people involved. While the role actually requires very little technical knowledge and should NOT be hands-on-keyboard, the incident commander is usually a senior individual contributor who knows who is who in the org, and can command respect and trust.
The initial incident email should be distributed as widely as possible. In a decade in the industry, Kevin has seen many incidents fail to receive appropriate attention due to lack of an email declaring the incident. On the other hand, he’s rarely seen incidents made worse due to such an email. If you’re concerned about security, you can always leave sensitive details out of the message.
The initial email should communicate:
- what the issue is
- the severity of the issue
- how far along you are in response to the issue
- who is managing the response coordination
- where the team is coordinating
- who else is involved and in what capacity.
This may seem very formulaic and prescribed, but good communication habits allow for flexibility during the response, even before developing a formal process. Every recipient should understand what the severities mean, from a mild “temporary internal outage,” to a severe “the future of the company is at stake” crisis.
Every follow-up email should contain:
- the incident’s current severity level
- The incident’s phase
- Incident manager and members involved
- Any changes or updates.
If the incident affects customers, Kevin suggests adding a customer liaison, who should send out a statement acknowledging the incident even if a plan for resolution has not been worked out yet. For customers, the uncertainty is often more disconcerting than the incident itself, and keeping them informed preempts frustration and allows them to mitigate their own potential downtime and losses.
Tech companies these days often employ a microservice architecture, with up to thousands of services, each handling a single task – pricing, routing, deep learning – talking to each other over the network. The benefits of microservices are manifold, but they can add to the confusion during an incident. An upstream dependency might go down, bringing your service along with it, and you have no idea what the service is, what it does, or who is working on it.
One trick that Kevin says worked really well for them at Akamai was keeping a static html page of people who were on-call for every single system. During incidents, the incident commander could call on these engineers as subject-matter experts. Instead of having to search through internal documentation or design docs for their names, they were aggregated in one place, making the process of assembling the incident response team much easier.
In the aftermath of an incident, there will be an incident post-mortem meeting where the incident is discussed, the chronology reviewed and action items are divvy-ed out.
Kevin advises pulling in engineering leadership into these meetings. By engineering leadership, he means not just the middle managers or senior IC’s, but the VP’s of engineering and CTO’s who are determining engineering priorities and have control over the budget. At Akamai, Kevin recalls that only after these decision makers were made acutely aware of incidents and their impacts were true strides in reliability made – because that’s when engineers were hired or reallocated to refactor buggy systems or take over ownership of orphaned ones.
Finally, the incident commander will assign someone to write the postmortem report. This report is a written record of the incident that should cover:
- The incident’s cause
- The incident’s impact
- Steps taken to resolve the incident
- Steps to take to prevent the incident from happening again
There are a number of postmortem templates available across the internet, and the LIR product provides a library of them, but perhaps the most important characteristic of a postmortem is that it should be blameless.
This means that engineers whose actions contributed to the incident should be able to give a detailed account of what happened without fear of punishment or retribution. In particular, the emphasis in the written report should be on the failures of the systems or processes to catch human error, rather than the human error itself. “Jim pushed bad code” should never be the takeaway of a postmortem.
Which brings us to Kevin’s final suggestion. The postmortem is about improving future performance, and it often includes action items – things like adding checks in the code, setting up new alerts, or larger undertakings like decommissioning a whole system. There’s a temptation, in the immediate aftermath of an incident, when the chaos and impact is still fresh, to take drastic, ambitious action to make sure it never occurs again. Over time, though, this urgency fades and most action items are never completed. Kevin advises that the only action items that should be assigned are action items that must be done in the next few days. Other action items should go into quarterly planning meetings.
As you can see from these tips, the core of a good incident response process is actually cultural – blameless post-mortems, robust communication, and universal accountability. On this foundation, tooling like LIR’s can make the process more seamless and integrated. But it starts with the basics.