Incident Management Best Practices
by Keanan Koppenhaver
No matter how well architected your application is or how robust the infrastructure supporting your code is, there will be times when things go wrong. Someone might ship a bug, or a big component of Amazon Web Services (AWS) might go down and take your entire application with it. When something like this happens, it's stressful and can make it difficult for you and your team to think clearly in order to resolve the issue. That's why incident management is so important.
Creating an incident management plan and policies around what to do when things go wrong means that in the heat of the moment, your team won't have to think about what to do. You'll be able to follow these best practices, while immediately beginning to resolve incidents as everyone becomes familiar with these incident management best practices.
In this article, you'll take a look at some incident management best practices
You’ll learn how to define an incident, team roles, and responsibilities, as well as how to ensure resource access, key performance indicator (KPI) monitoring, and incident management resolution. You’ll also learn about some tools that can help you better manage incidents when they do arise so that you and your team can efficiently solve whatever problem you’re experiencing.
The first step in incorporating incident management into your organization is acknowledging that despite all your best efforts, there will be times when things don't go as planned. And when things happen that are outside your control, you need to know how to handle the issues rather than trying to create a plan while the incident is occurring.
That's not to say that you'll be able to plan for every contingency or anticipate every sort of bug or outage you might have, but creating an incident management plan is about building a framework that helps everyone know what their responsibilities are and gives procedures to follow that will help you handle an incident quickly and efficiently.
When incident management best practices are followed, downtime is reduced, which leads to greater uptime and site reliability overall. By using an effective incident management process that reviews incidents and establishes quality control measures, the incidents will begin to occur less frequently.
With that in mind, what are some of the things you can do to improve the incident management processes within your organization? Teams need to be aligned and work towards the same goal. You can do this by following effective incident management best practices and ensuring that incident communication is in place.
IT Infrastructure Library (ITIL) incident management ensures that there is a standard process to follow in order to resolve each incident, no matter how different each individual incident may be. This guarantees that every incident has a satisfactory resolution and can be reviewed in the future if more information is needed. Most incident management processes take the following steps:
Incident logging: discovering the issue and identifying it as a problem that needs to be dealt with.
Incident categorization and prioritization: deciding as a team how serious the incident is and whether it needs to be dealt with right away or can be reprioritized later.
Incident assignment: deciding which department or team needs to work on the incident.
Task creation and management: breaking down the tasks that are needed to resolve the incident.
Escalation: bringing in other departments to assist with incident resolution (if needed).
Incident resolution: resolving the incident and communicating to all the relevant parties that the incident has been resolved.
These steps standardize the process for each incident lifecycle and help the entire organization resolve any incident more efficiently. In addition, following the ITIL incident management process means that no necessary steps are excluded, which can happen when an incident management process is more impromptu.
Especially when there are multiple people working to resolve an incident, it's important that the incident is well defined. Is the website down and needs to be back up for the incident to be resolved? Is customer data being lost and making your product unusable? Whatever it is, if everyone in your organization is on the same page with regards to the incident, you'll be able to resolve issues quickly.
One efficient way to define an incident is categorizing it by its urgency to determine the severity of the incident. For example, an incident that impacts a large segment of your customers and needs to be fixed urgently should be prioritized higher and given more resources because it’s a higher severity than an incident that only occasionally affects a small subset of customers.
For example, an incident’s priority might be determined by the following matrix:
In the table above, P1 incidents (or major incidents) are the highest priority and should be handled immediately, whereas P4 and P5 incidents can usually be logged to investigate later as they haven’t progressed to a level of severe impact or urgency.
Learn more about major incident management and a few best practices on the management process.
In addition to clearly defining the incident, it's important that everyone on the team knows what their responsibilities are. If this is a public-facing incident, there needs to be someone handling customer communication—making sure users get frequent updates as to how the resolution of the incident is going and when they can expect the bug to be fixed. This responsibility often falls to a different person or team from those actually working on resolving the incident, so these two positions need frequent internal communication and to ensure customer satisfaction.
The roles on the incident response team might break down as follows, although some teams, may require additional roles:
The Incident Commander or Incident Manager, leads the incident resolution effort and coordinates all the teams. This is generally the first person that is paged by default.
The Investigative Lead works directly with those who are researching the problem and coming up with potential solutions.
With a customer-facing incident, the Communication Lead is in charge of communicating updates as the process progresses and ensures customer satisfaction.
The Documentation Lead keeps track of everything as the entire team moves through the process. The information they gather and document is especially important for generating a final report for the entire incident lifecycle.
If there are multiple people or teams working on investigating the incident, it's important that each of them knows their role so that multiple conflicting fixes don't get deployed and make the incident worse.
To fully resolve an incident, the incident management team needs to ensure that the metrics and KPIs that were affected by the incident have returned to normal parameters and that all the affected parties have been informed that the incident is resolved. A formal report about the entire incident lifecycle will be provided with more detail at a later date, but there needs to be a high degree of confidence that the issue is fully solved with enough information captured to generate the report.
It’s important that a resolution is found quickly and efficiently in order to continue to provide the best possible experience to your users. If issues continue to linger or happen more frequently, it can undermine confidence in your system and can create bigger issues in the future. To assist with the full resolution of an issue, runbooks are often created ahead of time that detail known processes and tasks to follow in order to diagnose and resolve specific incidents. These are often updated as future incidents are encountered and fixed so that any learning from the incident management process can be used to resolve future incidents quickly.
Incidents can range from something very small-scale all the way up to problems that impact and affect the entire company (major incidents). For this reason, it's important to know when to get other departments in the company involved and when to contact other prominent individuals in the organization.
If a particular incident is focused on just the IT department and is not customer-facing, it's probably not important to get the CEO or the marketing department involved in real-time, especially if the incident occurs outside business hours. However, if customers’ credit cards are not being processed or some other mission-critical type of incident is occurring, more individuals need to be involved, no matter what time it is.
Having clear policies and best practices as to what defines an incident as needing to be escalated prevents the incident response team from wasting time trying to make these decisions at the moment.
One of the most frustrating aspects of incident management is when the team assigned to investigate and resolve the incident doesn't have the required access to do their jobs. This might take the form of not having the necessary server credentials, access to the correct databases or business records, or any number of things. When this happens, the incident response team can find themselves in the unfortunate situation of knowing how to resolve an incident, but having to wait for other stakeholders to give them the access they need to do so, even if those stakeholders wouldn't have needed to be involved according to the escalation policies and roles and responsibilities previously defined.
Not every incident can be predicted fully, but having a process where crucial team members can receive and escalate access in order to more quickly resolve incidents is crucial to ensure a well-functioning organization. By ensuring teams can collaborate with each other clearly and efficiently, the incident management team will be able to escalate as required to resolve the incident based on which team has access to different layers of the app.
Part of defining the incident is knowing what KPIs to measure to alert everyone to the fact that the incident is resolved. In many cases, this is website uptime: if the website is back up and stable, then the incident is resolved. In other cases, key indicators might be something like a certain amount of data being processed in a given time period (if an internal data system was affected) or even something more binary like whether internal company users can submit IT support tickets. These metrics, once defined, are often made publicly available via something like a status page. This means that even those not directly involved in monitoring and responding to potential incidents, have access to these KPIs and can know more quickly when changes need to be made in response to changes in the metrics.
Monitoring and reporting is a crucial step of the incident management process because it both provides information on whether the incident is truly resolved and provides more detailed information that can be shared with others in the organization about why the incident occurred and what was done to fix it, allowing others to work to avoid similar incidents in the future.
An incident retrospective might look something like the following:
It clearly defines who was involved, what the issue was, and how it was resolved before going into more detail for anyone who wants more information. Producing retrospectives like these helps educate people about past incidents and helps prevent future ones..
In any incident, it's important that all teams are on the same page and working in unison so that the process is as smooth as possible. And while it’s possible to manage this process manually, using an incident management platform like Lightstep makes it much easier.
Lightstep allows teams to manage their whole incident process from collaboration and incident roles to stakeholder updates and retrospectives. If you’re looking for a better way to manage incidents within your organization, sign up for a free trial today.
No matter how you manage incidents within your organization, it's important to have a plan ahead of time so that respective team members know their roles and responsibilities, while ensuring communication is clear throughout the incident lifecycle. These best practices help guide your entire organization to effectively execute the incident management process.