11 Incident Response Best Practices and Tips
by Michelle Ho
There’s been a lot of chatter around incident response these last few years, so much so that it can be overwhelming. Mature teams can now choose from a whole ecosystem of tools, dashboards, alerting systems, paging systems, root cause detection systems, etc. But what if you’re a maturing team that’s just starting to standardize your incident response? What are some incident response best practices? Here are eleven to start with:
1. Set up and tune your automated alerts
Automated alerts give you visibility into your software systems without having to manually refresh dashboards. But consistently noisy alerts – alerts that are false positives – erode trust and annoy on-call engineers. If you have automated alerts, make sure that the thresholds are tuned appropriately, and that correlated alerts are grouped together to prevent an overwhelming wave of emails or pings when an incident occurs. This is easy to do in the LIR product.
2. Every engineer should have experience being on-call
Having a more expansive on-call rotation not only lightens the load for everyone and makes the company more resilient, it also aligns incentives with the builders. You’re going to be a lot more willing to invest in code review and integration testing if you’re the one being paged when the code breaks. What’s important to establish is a sense of responsibility and preparation across the entire org. This aligns to the notion of an error budget for the organization.
3. Have junior engineers shadow senior engineers on-call
Incident management is not something you can teach off a slide deck. Apprenticeship – having junior engineers shadow senior engineers on-call, or listening to them tell “war stories” of past incidents – is far more effective as a best practice.
4. Separate the role of incident commander from subject matter expert
The incident commander is responsible for guiding an incident to its resolution, managing the plan, communication, and people involved. The incident commander should NOT be hands-on-keyboard; instead, they should call on subject-matter experts as appropriate. Hot tip: one trick that Kevin Riggle, a security consultant who ran the incident response team at Akamai, says worked well for them, was keeping a static html page of on-call experts for every single system.
5. Avoid hosting incident response channels on your own infrastructure.
There’s nothing more embarrassing and frustrating than being in the middle of an incident and being unable to remediate or communicate about it because the software used for your observability and incident response management is also down. A best practice is hosting your incident response channels on a third-party platform like LIR helps mitigate this risk.
6. Communicate early, often, and broadly
The initial incident declaration – whether via email, status page, or stakeholder notifications through a product like LIR, should be distributed as widely as possible. If you’re concerned about security, you can always leave sensitive details out of the message.
The initial declaration should communicate:
- What the issue is
- Severity of the issue
- How far along you are in response to the issue
- Who is managing the response coordination
- Where the team is coordinating
- Who else is involved and in what capacity.
7. Don’t keep your customers in the dark
If the incident affects customers, have a customer liaison or communication lead send out a statement acknowledging the incident even if a plan for resolution has not been worked out yet. For customers, the uncertainty is often more disconcerting than the incident itself, and keeping them informed preempts frustration and allows them to mitigate their own potential downtime and losses.
8. Mitigate, then identify and fix the root cause
During the incident is not the appropriate time to be trying to find the root cause. Always mitigate the effects of the incident first. For instance, if the incident was caused by a bad new code change, roll back to the last “good” version of the code, before pushing a fix of the new change. After applying a patch for mitigation, the post mortem discussed next can prevent the issue from recurring.
9. Pull engineering leadership into incident review meetings
In the aftermath of an incident, there will be an incident post-mortem meeting where the incident is discussed, the chronology reviewed and action items are divvy-ed out. A best practice is that engineering leadership – the VP’s of engineering and CTO’s who are determining engineering priorities and have control over the budget – should be pulled into these meetings. Only after these decision makers are made acutely aware of incidents and their impact can true strides in reliability be made – because that’s when engineers are hired or reallocated to reduce tech debt.
10. Have a blameless, feasible, post-mortem culture
Finally, the incident commander will assign someone to write the postmortem report. This report is a written record of the incident that should cover:
- The incident’s cause
- The incident’s impact
- Steps taken to resolve the incident
- Steps to take to prevent the incident from happening again
There are a number of postmortem templates available across the internet, (and the LIR product provides one best practice template that automates filling out summary and timeline from incident details), but perhaps the most important characteristic of a postmortem is that it should be blameless.
This means that engineers whose actions contributed to the incident should be able to give a detailed account of what happened without fear of punishment or retribution. In particular, the emphasis in the written report should be on the failures of the systems or processes to catch human error, rather than the human error itself. “Jim pushed bad code” should never be the takeaway of a postmortem.
11. Be realistic about post-mortem action items
The postmortem is about improving future performance, and it often includes action items – things like adding checks in the code, setting up new alerts, or larger undertakings like decommissioning a whole system. There’s a temptation, in the immediate aftermath of an incident, when the chaos and impact is still fresh, to take drastic, ambitious action to make sure it never occurs again. Over time, though, this urgency fades and most action items are never completed. The only action items that should be assigned are action items that must be done in the next few days. Other action items should go into quarterly planning meetings.
As you can see from these best practices, the core of a good incident response process is actually cultural – blameless post-mortems, robust communication, and universal accountability. On this foundation, tooling like Lightstep Incident Response can make the process more seamless and integrated. But it starts with the basics.
Learm more about the roles and responsibilities of an incident response manangement team and how to build an incident response playbook