Incident Communication Best Practices
by Michelle Ho
When incidents occur, your team relies on efficient communication to restore service functionality and retain customer confidence. Internally, you need to quickly mobilize the right people and resources and keep important stakeholders updated. Externally, you need to stave off a potential customer service nightmare.
Formalizing a communication plan is an essential part of the preparation stage of the incident response lifecycle. Even if the nature of future incidents is unpredictable, you can set protocols for your team like:
- What constitutes an incident? This is important both internally and for communicating with users/customers.
- What communication channels does your team use?
- What is the plan for a secure communication method if the incident is impacting an organization’s own infrastructure?
- Who takes point on communication inside your team, and who is the designated customer liaison?
- Which social media channels will be used during an incident?
Below, we share some best practices for incident response communication.
Prior to an incident, the organization should establish which communication channels will be used internally and externally.
Incident dashboard -an incident dashboard that is accessible to anyone in the company, hosted at an easy-to-remember subdomain, is a high-leverage way of broadcasting incidents. If anything is not working, any employee can first check the incident dashboard to see whether a relevant incident has been reported.
Email -incident response email updates and threads are one of the best ways to keep everyone on the same page, as information can easily get lost in chat tools when everyone is typing simultaneously.
Chat tools -chat tools work well to reduce context switching and sync conversations. Although too dynamic for a main incident channel, they are good for backchannel discussions.
Video and phone conferences -video conferencing and phone calls enable real-time conversations. Team members can discuss options and set action plans on Zooms and phone bridges. One member should be responsible for taking minutes and sending out an email summary.
A dedicated status page like this or this is the most centralized way to convey real-time status information. Customers benefit from being able to look at the status page to see if the issue they are experiencing has been reported, and if so, any relevant information or timeframe for resolution. Having a status page allows affected parties to proactively find information, eliminating the flood of duplicate support requests. It can also be configured to send emails, SMS or app notifications. Your team can focus on fixing the problem instead of handling support tickets.
Social media channels like Twitter can be employed as part of your incident communication strategy, but should not be the main channel.
SMS messaging may be preferred when it comes to critical inbound alerts like a downtime announcement. However, it’s also a channel where people rapidly experience message fatigue and will unsubscribe upon receiving too many messages they don’t find relevant.
The organization should establish templates for both internal and external communications. Having pre-written templates ready-to-go reduces the likelihood that important information will be excluded in missives that go out in the heat of the incident.
A sample internal email communication might look something like this:
Subject: ap-southeast-3 Aurora database has stopped responding to requests
Severity 2, phase 1. I am incident commander. Coordinating via #2022-02-01-database-policies-table-failure channel on Slack.
The ap-southeast-3 Aurora database cluster stopped responding to requests at 0:04 UTC (17:04 PT). Users are not able to access their policies; new policies being created are not being saved.
Customer liaison: Dan M.
Business liaison: Alice S.
Subject-matter experts: Jamie N., Eric B.
This is a short email, but it contains all the relevant information.
- The title gives a quick summary of the symptoms of the incident
- The severity gives readers a sense of how bad the incident is
- The phase describes how far along you are in resolving the incident:
- Phase 1 is the discovery phase.
- Phase 2 is the acute remediation phase to mitigate impact.
- Phase 3 is the recovery phase.
- Phase 4 is the postmortem.
Each time you move into a new phase, you should send out a followup email.
The incident commander is the person in charge of the incident response, who is making all decisions.
Where coordination is taking place. This might be a Slack Channel or Zoom Room or another virtual space. Lightstep incident response provides a workspace where incident documents can be stored and links to other communication channels provided.
This lists the various Subject Matter Experts, Liaisons, etc who are involved in the incident response.
External communications in announcing the incident should be less technical and detailed than the internal communications, but it should still include:
- What the problem is
- How might the customer/user be affected
- How far along in the response you are
- When the customer can expect the next update
An example message for an initial external communication might be something like:
We are investigating an emerging issue where users in India and Indonesia are not able to access their existing policies or create new policies. We are actively investigating the issue and will provide another update within the next 60 minutes.
If an incident is sufficiently serious, appoint a business liaison to communicate with business and executives about what is happening and potential business consequences. This avoids situations like the CEO having to badger individual engineers for updates, which can be distracting and intimidating.
If the incident affects customers, have a customer liaison or communication lead send out a statement acknowledging the incident even if a plan for resolution has not been worked out yet. For customers, the uncertainty is often more disconcerting than the incident itself, and keeping them informed preempts frustration and allows them to mitigate their own potential downtime and losses.
While different incidents require different degrees of communication, the following things must always be communicated as broadly as possible in an initial communication.
- Symptoms of the incident and systems/services impacted
- Severity of the incident
- How far along in the response you are
- Incident commander in charge of coordinating the response [Internal Communication]
- Where the coordination is taking place [Internal Communication]
- Who else is involved and in what capacity
If it’s a sensitive incident, you can mitigate security-related concerns by leaving details out of the initial communication.
After the initial email has been sent, all subsequent emails should include at least:
- The incident’s current severity
- The incident’s current phase
- The incident commander, including any changes.
- Where the coordination is taking place, including Zoom, Slack links, etc.
Communication After the Incident
After the incident’s resolution, your team will want to hold an incident post-mortem meeting to discuss the incident, review the chronology and actions taken, and assign follow-up tasks to prevent future incidents. A written post-mortem document should cover:
- The incident’s cause
- The incident’s impact
- The steps taken to resolve the incident
- The steps taken to prevent the incident from happening again.
This document is about improving future performance and service resilience. Depending on the incident type and audience sophistication, it should also:
After the incident is resolved, your team should provide a simple and direct incident resolution message. This message should:
- Acknowledge the problem
- Empathize with those affected and apologize
Depending on the incident type and audience sophistication, it should also:
- Explain what went wrong
- Explain what was done to fix the incident
- Explain what was done or will be done to prevent repeat incidents.
For example, here’s a public postmortem that Robinhood published after its March 2020 outage:
When it comes to your money, we know how important it is for you to have answers. The outages you have experienced over the last two days are not acceptable and we want to share an update on the current situation.
Our team has spent the last two days evaluating and addressing this issue. We worked as quickly as possible to restore service, but it took us a while. Too long. We now understand the cause of the outage was stress on our infrastructure—which struggled with unprecedented load. That in turn led to a “thundering herd” effect—triggering a failure of our DNS system.
Multiple factors contributed to the unprecedented load that ultimately led to the outages. The factors included, among others, highly volatile and historic market conditions; record volume;and record account sign-ups.
Our team is continuing to work to improve the resilience of our infrastructure to meet the heightened load we have been experiencing. We’re simultaneously working to reduce the interdependencies in our overall infrastructure. We’re also investing in additional redundancies in our infrastructure.
As you can see, the postmortem acknowledges the outages and that they are unacceptable. It then explains what happened – “unprecedented load”, “failure of our DNS system” – and what is being done to prevent this happening again in the future - “improve resilience of infrastructure” and “reduce interdependencies.”
Note that the public postmortem is written at a much higher level of abstraction than the internal postmortem would be, but it still captures the technical essence of the incident – a thundering herd effect that resulted in a failure of the DNS system.
Clear and effective communication practices, both internally and externally, are the foundation of good incident response.
Lightstep Incident Response provides a consolidated platform for all your incident communication needs, from declaring an initial incident to alerting the on-call team to providing a status page that can be viewed by stakeholders. It integrates nicely with your existing communication channels like Zoom and Slack, stores playbooks and templates, and reduces the manual ops work involved in incident response communication.
Learn more about the roles and responsibilities of an incident response management team.