In this blog post
Driving Effective Incident Management with Escalation PoliciesDriving Effective Incident Management with Escalation PoliciesWhat is an Escalation Policy?What is an Escalation Policy?Why Have an Escalation Policy?Why Have an Escalation Policy?Best Practices for Developing Escalation PoliciesBest Practices for Developing Escalation PoliciesTypes of Escalation PoliciesTypes of Escalation PoliciesEscalation PathsEscalation PathsEscalation LevelsEscalation LevelsEstablishing a Clear ProcessEstablishing a Clear ProcessSign up for aSign up for aDriving Effective Incident Management with Escalation Policies
In any organization, it’s crucial to have a plan in place for unexpected problems. A proper incident management planincident management plan explains how to determine that an incident is taking place, who in the organization to contact for help, and what steps are needed to quickly investigate and resolve the problem.
What is an Escalation Policy?
An escalation policy is a set of rules that define how and when an incident should be escalated. The goal of an escalation policy is to ensure that incidents are resolved in a timely manner by the appropriate personnel.
The escalation process is often overlooked—in other words, when other people or teams need to get involved and why. For instance, if your software has developed a bug or suffered a security breach but the on-call team members aren’t able to fix the problem, escalating the problem to the right people ensures that it will be resolved as smoothly as possible. If you create an escalation matrix, your team members will have a guide to follow when problems occur.
This article will explain the escalation policies, paths, and levels you need to have in place to handle your organization’s software emergencies.
Why Have an Escalation Policy?
When an incident is serious enough to impact key performance indicators (KPIs) for your organization, other teams or individuals need to be notified of the situation or brought in to help address it. Some of these KPIs for your escalation matrix might be your website being down (uptime), your application not being able to process payments (revenue), or functions of your application not processing as quickly as they normally would.
When a KPI that’s important to the business has been affected, you need to escalate the incident. In this case, escalation accomplishes a few things:
Accessing Critical Resources As soon as your incident expands to involve other departments, you might not have the necessary access or permissions to fully resolve it. You need to get team members involved who do have that access, especially if the issue is affecting their department.
Keeping Everyone Aware Even if an incident doesn’t involve others in the company directly, it might have an impact on external communication or other job duties. This means they need to be informed about what’s going on, especially with more serious issues
Updating Necessary Documentation Once an incident is resolved, team members need to be able to update any relevant documentation or procedures to reflect lessons learned. To do this properly, they need to have all necessary information. This can wait until after the incident management process has been completed.
Best Practices for Developing Escalation Policies
In order to properly handle incident escalations, you must implement clearly defined policies and escalation paths based on how high the escalation needs to go. When creating your escalation policy, consider the following factors:
KPI Monitoring What KPIs decide when an incident needs to get escalated? Is it when customer data becomes affected, for instance, or when payments can’t be processed? Keep in mind that your criteria for an internal system might differ from those of an external system.
Process Automation Is any part of the incident detection and escalation process automated? Sometimes an automated system detects a problem with your code and launches the incident management process. In this case, you must define how severe an alert from that system must be to trigger an incident. You’ll also need to define the steps in the escalation process, as well as whether it requires manual approval or can proceed automatically.
Incident recording How are escalation incidents recorded? You’ll need to define some metrics around this. For example, if every incident that is escalated requires an incident report written by the team member who first responded, all team members must know this so that the proper data can be collected from the beginning. You’ll also need to determine how the post-incident report is shared. This might involve anything from a dedicated Slack channel to a shared file store; the incident management team might even have to share their report in a post-mortem meeting so everyone can learn from the incident response.
Types of Escalation Policies
There are several types of escalation policies. Which one you use will depend on the needs of your organization. The following are the most common:
Hierarchical Escalation Policy A hierarchical escalation policy is defined by team members’ seniority. Junior members of the team, who are generally handling the on-call or monitoring duties, follow guidelines about when to escalate an issue to more senior team members for additional triage or investigation.
Functional Escalation Policy A functional escalation policy is defined by escalation to the specific team members who are best suited to deal with the problem. For example, if an incident is identified by a frontend developer but is revealed to be an infrastructure issue, the incident could be escalated to the infrastructure team, even if those on call are of similar seniority.
Automatic Escalation Policy An automatic escalation policy is defined by thresholds that determine whether an incident should be escalated. For example, some automatic escalation policies determine that an incident should be escalated if it hasn’t been resolved after a certain amount of time or if it’s more severely affecting other KPIs.
The right model is the one that makes the most sense for your organization based on its structure and the makeup of your teams. You can also combine models as needed.
Escalation Paths
An escalation path is a part of the escalation policy that identifies who an incident needs to be escalated to under various circumstances. Once the decision to escalate has been made, it needs to be implemented. That’s where the concept of an escalation path comes in.
In many cases, an escalation path involves a call tree, or a specific list of people who need to be notified in order to escalate an incident. If a junior engineer is on call, for instance, a more senior engineer might be next up on the call tree. If further escalation is needed, the next person might be an engineering manager or VP of engineering. The call tree extends as far up the company as possible, so even the most severe problems can be addressed.
The other important factor is knowing when to escalate. Your escalation procedure, whether handled automatically or at a team member’s discretion, should specify thresholds for this based on KPIs and other metrics. This ensures that only important incidents get escalated and no one suffers from “alert fatiguealert fatigue” caused by constantly dealing with escalations that could have been handled another way.
Escalation Levels
Depending on the severity of the incident, it may need to be escalated to a different level of the company. Different escalation triggers might force an incident to be promoted to a different escalation level.
For example, if one of your escalation triggers is an impact on production-level site traffic, the site completely going down is a high-priority incident or a major incidentmajor incident, while a bug that only slows down ten percent of site traffic would be less of a problem. In this case, the incident affecting production-level site traffic would likely be escalated to the highest level, while the lesser bug might get less of a priority inside the organization.
As part of your escalation policy, it’s important to add context around the escalation triggers you’ve defined for your organization. By doing this, it will be clear when an incident needs to be escalated even higher. This ensures that incidents are handled as efficiently as possible.
Establishing a Clear Process
Organizations must implement an incident escalation policy so that software problems or emergencies can be managed properly. When creating your own escalation matrix, you need to define the types of incidents for escalation, the team members who should be involved, and the procedures needed to solve the problem. With these policies and procedures in place, the incident management teamincident management team can work quickly and efficiently while keeping any relevant stakeholders informed.
Incident responses can always be handled manually; however, you could save time and effort by leaving these tasks to a platform like LightstepLightstep. Lightstep provides a speedy, comprehensive incident response solution, including on call, collaboration, and automation. It’s cloud-agnostic and integrates with a variety of languages, libraries, and CI/CD workflows.
Sign up for a free Lightstep trialfree Lightstep trial to find out how the platform can help you manage and fix software problems.
In this blog post
Driving Effective Incident Management with Escalation PoliciesDriving Effective Incident Management with Escalation PoliciesWhat is an Escalation Policy?What is an Escalation Policy?Why Have an Escalation Policy?Why Have an Escalation Policy?Best Practices for Developing Escalation PoliciesBest Practices for Developing Escalation PoliciesTypes of Escalation PoliciesTypes of Escalation PoliciesEscalation PathsEscalation PathsEscalation LevelsEscalation LevelsEstablishing a Clear ProcessEstablishing a Clear ProcessSign up for aSign up for aExplore more articles

Monitoring Apache with OpenTelemetry and Lightstep
Andrew Gardner | May 2, 2023Continue your observability journey by ingesting metrics from Apache and sending them to Lightstep.
Learn moreLearn more
Monitoring MySQL with OpenTelemetry and Lightstep
Andrew Gardner | Apr 11, 2023Learn how to ingest metrics from MySQL and send them to Lightstep.
Learn moreLearn more
Monitoring NGINX with OpenTelemetry and Lightstep
Robin Whitmore | Apr 6, 2023Learn how to start ingesting metrics from NGINX and send them to Lightstep for more intelligent analysis and monitoring.
Learn moreLearn moreLightstep sounds like a lovely idea
Monitoring and observability for the world’s most reliable systems