Incident Response Communication
by Michelle Ho
When incidents occur, your team relies on efficient communication to restore service functionality and retain customer confidence. Internally, you need to quickly mobilize the right people and resources and keep important stakeholders updated. Externally, you need to stave off a potential customer service nightmare.
While different incidents require different degrees of communication, the following things must always be communicated as broadly as possible:
- Symptoms of the incident and systems/services impacted
- Severity of the incident
- How far along in the response you are
- Incident commander in charge of coordinating the response [Internal Communication]
- Where the coordination is taking place [Internal Communication]
- Who else is involved and in what capacity
If it’s a sensitive incident, you can mitigate security-related concerns by leaving details out of the initial communication.
After the initial email has been sent, all subsequent emails should include at least:
- The incident’s current severity
- The incident’s current phase
- The incident commander, including any changes.
- Where the coordination is taking place, including Zoom, Slack links, etc.
A sample communication might look something like this:
From: firstname.lastname@example.org To: email@example.com Subject: ap-southeast-3 Aurora database has stopped responding to requests Severity 2, phase 1. I am incident commander. Coordinating via #2022-02-01-database-policies-table-failure channel on Slack. The ap-southeast-3 Aurora database cluster stopped responding to requests at 0:04 UTC (17:04 PT). Users are not able to access their policies; new policies being created are not being saved. Customer liaison: Dan M. Business liaison: Alice S. Subject-matter experts: Jamie N., Eric B.
This is a short email, but it contains all the relevant information.
- The title gives a quick summary of the symptoms of the incident
- The severity gives readers a sense of how bad the incident is.
While there is no objective definition for Severity 1 vs. Severity 2, in general, incidents that have non-trivial customer-facing impact are automatically Severity 2. Do not spend more than a few minutes choosing the severity of an incident – you can always go back and upgrade/downgrade it later. What’s more important is that everyone at the company roughly understands the significance of each level of severity. 3. The phase describes how far along you are in resolving the incident. Phase 1 is the discovery phase. Phase 2 is the acute remediation phase to mitigate impact. Phase 3 is the recovery phase. Phase 4 is the postmortem. Each time you move into a new phase, you should send out a followup email. 4. The incident commander is the person in charge of the incident response, who is making all decisions. 5. Where coordination is taking place. This might be a Slack Channel or Zoom Room or another virtual space. Lightstep incident response provides a workspace where incident documents can be stored and links to other communication channels provided. 6. This lists the various Subject Matter Experts, Liaisons, etc who are involved in the incident response.
External communications in announcing the incident should be less technical and detailed than the internal communications, but it should still include:
- What the problem is
- How might the customer/user be affected
- How far along in the response you are
- When the customer can expect the next update
An example message for an initial external communication might be something like:
We are investigating an emerging issue where users in India and Indonesia are not able to access their existing policies or create new policies. We are actively investigating the issue and will provide another update within the next 60 minutes.
Now that you have crafted the incident communication, you need to distribute it. Your team will designate different channels for internal and external communications.
- Incident dashboard-an incident dashboard that is accessible to anyone in the company, hosted at an easy-to-remember subdomain, is a high-leverage way of broadcasting incidents. If anything is not working, any employee can first check the incident dashboard to see whether a relevant incident has been reported.
- Email-incident response email updates and threads are one of the best ways to keep everyone on the same page, as information can easily get lost in chat tools when everyone is typing simultaneously.
- Chat tools-chat tools work well to reduce context switching and sync conversations. Although too dynamic for a main incident channel, they are good for backchannel discussions.
- Video and phone conferences-video conferencing and phone calls enable real-time conversations. Team members can discuss options and set action plans on Zooms and phone bridges. One member should be responsible for taking minutes and sending out an email summary.
- A dedicated status page like this or this is the most centralized way to convey real-time status information. Customers benefit from being able to look at the status page to see if the issue they are experiencing has been reported, and if so, any relevant information or timeframe for resolution. Having a status page allows affected parties to proactively find information, eliminating the flood of duplicate support requests. It can also be configured to send emails, SMS or app notifications. Your team can focus on fixing the problem instead of handling support tickets.
- Social media channels like Twitter can be employed as part of your incident communication strategy, but should not be the main channel. The immediacy of SMS messaging may be preferred when it comes to critical inbound alerts like a downtime announcement. However, it’s also a channel where people rapidly experience message fatigue and will unsubscribe upon receiving too many messages they don’t find relevant.
After the incident’s resolution, your team will want to hold an incident post-mortem meeting to discuss the incident, review the chronology and actions taken, and assign follow-up tasks to prevent future incidents. A written post-mortem document should cover:
- The incident’s cause
- The incident’s impact
- The steps taken to resolve the incident
- The steps taken to prevent the incident from happening again.
This document is about improving future performance and service resilience. Depending on the incident type and audience sophistication, it should also:
After the incident is resolved, your team should provide a simple and direct incident resolution message. This message should:
- Acknowledge the problem
- Empathize with those affected and apologize
Depending on the incident type and audience sophistication, it should also:
- Explain what went wrong
- Explain what was done to fix the incident
- Explain what was done or will be done to prevent repeat incidents.
For example, here’s a public postmortem that Robinhood published after its March 2020 outage:
When it comes to your money, we know how important it is for you to have answers. The outages you have experienced over the last two days are not acceptable and we want to share an update on the current situation. Our team has spent the last two days evaluating and addressing this issue. We worked as quickly as possible to restore service, but it took us a while. Too long. We now understand the cause of the outage was stress on our infrastructure—which struggled with unprecedented load. That in turn led to a “thundering herd” effect—triggering a failure of our DNS system. Multiple factors contributed to the unprecedented load that ultimately led to the outages. The factors included, among others, highly volatile and historic market conditions; record volume;and record account sign-ups. Our team is continuing to work to improve the resilience of our infrastructure to meet the heightened load we have been experiencing. We’re simultaneously working to reduce the interdependencies in our overall infrastructure. We’re also investing in additional redundancies in our infrastructure.
As you can see, the postmortem acknowledges the outages and that they are unacceptable. It then explains what happened – “unprecedented load”, “failure of our DNS system” – and what is being done to prevent this happening again in the future - “improve resilience of infrastructure” and “reduce interdependencies.” Note that the public postmortem is written at a much higher level of abstraction than the internal postmortem would be, but it still captures the technical essence of the incident – a thundering herd effect that resulted in a failure of the DNS system.
Clear, effective communication practices, both internally and externally, are the foundation of good incident response. On top of this foundation, you can leverage tooling like Lightstep Incident Response to automate and amplify it. But it starts with the basics.
Learn more about the roles and responsibilities of an incident response management team.