Understanding the Major Incident Management Process
by Keanan Koppenhaver
The major incident management process (also referred to as just incident management in some organizations) is a series of steps and procedures that organizations put in place to help them handle major incidents. The goal of the major incident management (MIM) process is to resolve outstanding incidents as quickly as possible, while minimizing the impact on the business.
It's important to have a plan and process in place before disaster strikes. When you're in the middle of an incident, you don't want to be deciding who will lead communications or the investigation of the incident or any of the other necessary roles, responsibilities, and actions. Having a well-defined process for managing major incidents is critical for ensuring that these incidents are resolved as quickly and effectively as possible.
A major incident is typically defined as an incident that has a significant impact on the business, such as a major service outage or data breach. Major incidents can also be defined based on the severity of the incident and the amount of resources that are required to resolve it.
Major incidents are in a class of their own because they affect large portions of your users, critical portions of your infrastructure, or take your application down entirely. Major incidents should be addressed at all costs and must be handled quickly and efficiently to ensure that the associated fallout is minimal.
This article will look at how handling major incidents differs from any other type of incident management; it will also offer some steps you can take to ensure that a major incident occurs as infrequently as possible within your organization.
A major incident can be declared by a variety of people, depending on the organization. In some cases, only senior executives can declare a major incident. In other cases, the responsibility for declaring a major incident may fall to the IT or operations team.
It's important to have a clear process in place for who can declare a major incident and what criteria must be met for an incident to be classified as major. This will help ensure that major incidents are declared quickly and appropriately, and that resources are mobilized as soon as possible.
Once a major incident has been declared, it's important to inform all relevant parties as soon as possible. This includes everyone who will be involved in the incident response, such as the major incident manager, the technical team, and the communications team.
In addition, major incidents are most often noticed by customers and stakeholders, so managing them well not only ensures that fewer incidents will occur in the future, it also increases your site’s reliability. This is important for keeping your users happy and ensuring that your staff is not overloaded with the constant maintenance of your existing infrastructure.
When major incidents occur, time is always of the essence. Incidents are labeled “major” when they are customer-facing or have a large impact on your product or infrastructure. For this reason, it’s important to be sure that there is always someone on call to begin investigating the issue, regardless of when it happens, and that there is an incident management plan in place to do so.
A well-established and consistent process for handling major incidents ensures that your team will be ready to leap into action and find a resolution quickly and efficiently.
Having a standard framework for major incident management in place is important, but it doesn’t always make sense for a company to create their own. That’s why using something like the ** Major Incident Process Flow** can be helpful. According to this process, there are four phases in a major incident management process:
This stage includes all the early steps, such as initial identification of the incident, deciding who needs to be involved in the eventual resolution, and assembling the incident management team. Here, it’s important you have the roles of incident manager, lead investigator, and lead communicator covered. The incident manager coordinates everyone on the team working to manage the incident, the lead investigator coordinates the actual investigation and resolution of the issue, while the lead communicator ensures that stakeholders are informed on the progress of the incident resolution.
In this phase, the incident management team works to find out more about the issue at hand and to determine its severity. Escalation policies also come into play, as once the incident is categorized it should be clear whether other teams are needed to resolve the incident. The incident management plan is put in place, and the true resolution process starts. Communication with stakeholders is also very important during this phase.
The incident will be largely resolved by the recovery phase. The incident management team may start to see the relevant key performance indicators (KPIs) return to normal. However, monitoring should continue to ensure that the incident and any fallout that was caused by it are truly resolved.
Finally, after the root cause of the issue has been found and resolved and the key metrics verified, the incident can be considered closed. This stage involves communicating the resolution to stakeholders, as well as making sure the incident report is complete in terms of data. An incident report is a written report that contains all the information collected about the incident from when it was identified and what was initially done to confirm the issue, all the way through to the eventual resolution and how it was communicated. Because the report requires detailed information from all the phases, data collection throughout the incident management process is very important.
If you follow this guidance and ensure each step is taken as part of any incident management response, your organization will be in the best possible position to deal with whatever happens.
First off, it’s very important to define what a "major incident" entails. After all, if every incident is classified as major, your team might start to suffer from alert fatigue; this, in turn, might lead them to not respond to incidents with the required level of urgency. One way to define it is—unlike other incidents that can be recorded and scheduled to be dealt with at a later time—major incidents are almost always an "all hands on deck" situation.
A major incident could be anything with a large-scale customer impact, especially if it leads to complete downtime of a given service or portion of your application, or affects the ability of your organization to collect revenue from customers. Simply put, it's an incident that threatens the very viability of the business.
When it comes to major incidents, having the right major incident team in place is essential for a successful resolution. Based on your organization's size, roles and responsiblites will change. The common roles and responsibilities for major incident management are as follows:
- Service Desk Technicians
As the first line of defense in incident response, service desk technicians are responsible for handling all incoming tickets and escalating them as necessary. They also need to have a good understanding of the organization's major incident management plan so they can properly categorize and route tickets.
- Major Incident Manager
The Major Incident Manager is responsible for coordinating the major incident response team and ensuring that all necessary steps are taken to resolve the incident. They also liaise with other departments, such as customer support, to ensure that customers are kept up to date on the status of the incident.
- MIT (also known as Incident Response Team)
The major incident response team was (MIT) is a group of individuals with specific expertise who are assigned to resolve the major incident. The team will typically include people from different departments, such as customer support, engineering, and operations.
- Change Manager
The Change Manager is responsible for ensuring that all changes that are made during the major incident are tracked and reverified once the incident is resolved. This is to ensure that no changes were made in error and caused the incident in the first place. In addition, the Change Manager helps to track any lessons learned during the major incident and documents them so
- Problem Manager (Root Cause Analyst)
The problem manager is responsibile for any issues that arise during an incident. They're in charge of tracking down what went wrong and making sure the organization is prepared when another occurs. The Problem Manager's role is extensive, however we are capturing only the role and responsibilities tied to the major incident management processes in this article.
Now that you understand the roles and responsibilities of major incident management, it's important to know what not to do during a major incident. The following are some common mistakes that organizations make:
- Manual Communication
One of the most common mistakes made during major incidents is relying on manual communication. This can lead to delays in information being shared, which can hamper the incident response.
- Utilizing Ineffective Channels for Reporting
In major incidents, it's important to use the most effective channels for reporting. This might include setting up a dedicated chat communication channel or using an incident management tool. This ensures that all information is centralized and can be easily accessed by the team members who need it.
- Duplicating Efforts
Another common mistake is duplicating efforts during major incidents. This can lead to confusion and wasted time, which can be critical in major incidents. It's important to have a clear understanding of who is responsible for each task and to assign tasks accordingly.
- Poor Documentation
Failing to document major incidents can lead to a number of problems down the line. First, it makes it difficult to track lessons learned and make improvements. Second, it can lead to major incidents being repeated in the future. Finally, it hampers the team's ability to communicate effectively and share information.
- Root Cause Analysis Failure
Root cause analysis is the process of identifying the underlying cause of an incident. It is a key part of major incident management, but it's often overlooked. This can lead to major incidents being repeated in the future and can hinder the team's ability to learn from their mistakes.
Now that you understand the roles and responsibilities of major incident management and what not to do during a major incident, it's time to learn about best practices. The following are some best practices that organizations can implement for managing major incidents:
- Establish Channels For Reporting Major Incidents
This might include setting up a dedicated chat communication channel or using an incident management tool. This ensures that all information is centralized and can be easily accessed by the team members who need it.
- Automate Service Desk Processes
Utilizing automation for service desk processes can help to speed up the major incident management process and reduce the chances of human error. Automation can help to streamline the process and ensure that all tasks are completed in a timely manner.
- Build a Major Incident Management Playbook
In order to manage major incidents effectively, it's important to have an incident response playbook that outlines all of the steps that need to be taken. This will help to ensure that the team is prepared and knows what needs to be done.
- Aim For Quick, Relevant Communication
Quick communication ensures that all the major incident management team members are on the same page. This helps to avoid duplication of effort and ensures that everyone is aware of the latest developments.
- Establish Clear Documentation Standards
This will help to ensure that major incidents are properly documented and that lessons learned can be easily tracked. Proper documentation also helps to communicate information effectively and share it with other team members.
- Utilize an Incident Management Tool
There are a number of incident management tools available that can help to automate and streamline the major incident management process. Utilizing one of these tools can help to improve efficiency and ensure that all tasks are completed in a timely manner.
Learn more about Lightstep's all-in-one incident response platform.
Major incident management is a critical process for ensuring that incidents are rare and handled appropriately. By following the best practices listed above, organizations can improve their major incident management process and avoid common mistakes.
- MTTR (Mean Time To Resolution)
This is one of the most important major incident management KPIs because it's a measure of how quickly your team is resolving major incidents. This number can be used in post-mortems to set goals for future major incident responses.
- MTTA (Mean Time To Acknoledgement)
This metric is measure of how quickly your team is acknowledging major incidents. Acknowledging an incident means that someone on the team has taken responsibility for it and is working on a resolution. This number should be as low as possible, ideally below five minutes, so that incidents are resolved quickly.
- MTBF (Mean Team Between Failure)
This metric is a measure of how often major incidents are occuring. A high MTBF means that major incidents are rare, which is the goal. This number can be used to set goals for future major incident responses.
- MTTD (Mean Time To Detect)
This metric is a measure of how quickly your team is detecting major incidents. A high MTTD means that major incidents are being detected quickly, which is the goal. This number can be used to set goals for future major incident responses.
- Increase/Decrease of Major Incidents
This metric is a measure of how major incidents are trending over time. A decrease in major incidents over time is the goal. This number can be used to set goals for future major incident responses.
Achieving these goals will result in fewer major incidents and quicker resolutions when they do occur, making everyone on your team more efficient and effective.
Fortunately for anyone trying to learn from existing incident response protocols before creating their own, there are many public examples of large companies that experienced major incidents and the steps they took in the aftermath.
On January 31, 2017, GitLab suffered a major service outage when data was accidentally removed from their primary database server. In a detailed writeup of the data loss incident, they dissected the root cause of what happened, the steps they were taking to make sure it didn't happen again, and the lasting impact of the incident.
On an even larger scale, on December 7, 2021, Amazon Web Services experienced an outage that impacted many of their customers, including Netflix, Disney+, and Delta Airlines. Amazon knew that, as the infrastructure that powers a large portion of the internet, when they experienced an outage, it was essential to have an appropriate incident response plan and to publish their findings. Such actions let their customers know that they could still depend on Amazon systems.
If major incidents are not handled properly, you might be looking at upset customers, lost revenue, or even a completely broken product that impacts everyone in the organization. Because time is of the essence in these situations, you don’t want to be crafting a plan while the incident is happening. Having a major incident management plan already in place and knowing ahead of time how you’re going to respond means everyone on the team can work together to resolve the incident efficiently and get things back to normal as quickly as possible.
In this article, you’ve learned how to work through a major incident, from detecting it to assembling the incident management team, all the way through to making sure the issue is resolved.
As with any technical problem, having the right tools at your fingertips is important. When you're faced with a major incident, having a tool like Lightstep at your disposal will give you every advantage in handling the incident and getting back to normal as quickly as possible. Lightstep helps you handle schedules and escalations and provides dashboards for understanding and recovering from incidents quickly and reliably. Combining a robust incident management plan with Lightstep means that your team will be better equipped to deal with whatever incidents come their way.
By keeping everyone on the same page, following your incident management plan, and ensuring all the necessary stakeholders are informed and kept up-to-date, you can minimize the impact of even major incidents and keep your application running as smoothly as possible.