Lightstep from ServiceNow Logo

Products

Solutions

Documentation

Resources

Lightstep from ServiceNow Logo
< all blogs

How to Define and Track Incident Management KPIs

In order to effectively manage and respond to incidents, it is important to track key performance indicators (KPIs). Without proper KPIs, incident management can become difficult and time-consuming. In this guide, we will discuss some of the most important KPIs in incident management and how to report on them. By understanding and tracking these KPIs, you can improve your incident response times and ensure that your organization is running smoothly.

The Value of Incident Management KPIs and Metrics

Without clearly defined incident management metrics, the incident management process can feel like constant firefighting, with little improvement to the underlying systems between incidents. However, once KPIs are being measured and everyone on the team can see and track them over time, it’s much easier to get team buy-in to tackle longer-term improvements, as they can now see the tangible effect on the incident management processes.

As an example, let's say the business' goal is to resolve all incidents within 1 hour, but the team's current average is about 1 hour and 30 minutes. Without KPIs, it can be difficult to know what is going on.

Add in some metrics:

  • Rule out how long the incident management system alerts take (only if this is already known)

  • Diagnostics take up more than 50% of the time; focus on troubleshooting here.

  • If you see Team D is taking 25% more time than the other teams, you can begin to discover why this is so.

Mean Time To Respond

Visibility into the Pain Points of a System

Measuring KPIs helps teams examine which areas of their system are truly pain points and where improvements need to be made. For example, if there’s a long delay between an incident alert and the start of your team’s response, you might need to look at the effectiveness of the alerting system or whether the incident management team has the right tools to do their job.

Incident Response Effectiveness

Incident response is most effective when an incident is handled quickly, and efficiently, and involves all stakeholders needed for resolution. By tracking and measuring KPIs over time, the incident response team can start to understand which issues have effective responses associated with them and which types of issues require an improved response.

Better Decision-Making

Running an incident management process is all about decisions. Who needs to be involved? Do what it takes to resolve it quickly, but during the postmortem consider a more long-term solution. How much time and money is each potential solution going to cost? All of these decisions can feel very arbitrary and making the right choice can be challenging.

By tracking KPIs in the incident management process, you can get these questions answered, as you’ll have insight into which choice will provide a more favorable outcome.

Stability and Reliability

Tracking KPIs also helps improve the overall stability and reliability of the system. Decisions made using KPIs like the mean time between failures are more likely to lead to changes to the overall system to ensure it fails less often. As these stability challenges are tackled, your KPIs will improve, contributing to the overall cycle of improving the reliability of the system.

Types of Incident KPIs

There are endless different metrics that could be tracked, but ideally, organizations should focus on four to ten KPIsorganizations should focus on four to ten KPIs. While tracking more might be tempting, you could end up with too much information, making it difficult to communicate results and narrow down what’s relevant.

Limiting your number of KPIs allows you to focus on what’s important and see real movement in your chosen metrics. With that in mind, the following sections look at some KPIs your incident management team might want to track.

Mean Time Between Failures (MTBF)

MTBF is a metric that quantifies how often a system is failing unexpectedly or having problems that escalate to the level of an outage. Notably, it does not include scheduled downtime or maintenance. Often, MTBF is used as a metric that reflects the overall stability of a system or application: a more stable system will have fewer outages, and thus, a higher MTBF.

To calculate MTBF, you first need to define the period you’re measuring. For example, if you focus on the last twenty-four hours, and you had three incidents with 2 hours of unexpected downtimeunexpected downtime each within that period, your total uptime out of the 24 hours was 18 hours (not great!).

Written out as an equation, that’s:

(Total time – total amount of downtime) / total number of incidents = MTBF Or, as in the example, (24 hours – 6 hours) / 3 = 6 hours MTBF

This means that our incident management team is going to be very busy and time should be spent investigating why the system is experiencing unexpected downtime that frequently.

Mean Time to Detect (MTTD)

MTTD is the amount of time between when an incident or problem occurs in the system and when it's detected. This is a very important value to keep as low as possible because the longer bugs or issues exist in a system, the more problems they could cause.

MTTD is a good metric for measuring your overall operations tooling as well. If you have good observation, monitoring, and alerting tools in place, your MTTD will naturally be lower as you’ll be able to respond to incidents faster.

Time to detect is calculated as a straightforward subtraction equation. Take the time from when the incident was discovered and subtract the time from when the incident occurred (this may need to be calculated after the incident management process has been completed).

If an incident was discovered at 4:45 p.m. and after a retrospective, determined to have occurred at 4:00 p.m., then the time to detect that incident was 45 minutes. For MTTD, you total the amount of time it took to detect each incident in a defined period and then divide that result by the number of incidents.

For example, if you have three incidents in your selected period that took 45 minutes, 22 minutes, and 35 minutes to detect, your total time to detect is a hundred and 2 minutes. You then divide this by the total number of incidents, three, giving you an MTTD of 34 minutes.

In other words, that’s:

Total time to detect incidents / total number of incidents = MTTD Or, as in the example, (45 minutes + 22 minutes + 35 minutes) / 3 = 34 minutes MTTD

Mean Time to Acknowledge (MTTA)

MTTA tracks how much time elapses between the triggering of an alert and the incident management team acknowledging they’re working on a fix. MTTA is useful for determining how responsive your incident management team is too new issues and how well your alert systems are working.

For example, your system will have a high MTTA if your alerts aren’t sent to the right people because the alert will have to be escalated or re-routed internally before work on the issue can begin.

To calculate MTTA, you again select a time period that's relatively representative of your overall system. For example, if you know you just went through a period of the extremely high traffic that had everyone stretched thin and not at their typical capacity, this might not be a great period to select. However, if all your other metrics indicate that the period you’re observing was relatively typical for your system, then it’s probably a good period to observe.

Within your selected period, you total the time between alerts being triggered and your team acknowledging they’re working on a solution, which you then divide by the total number of incidents. For example, if you had 3 incidents during your observation period with 23 minutes, 30 minutes, and 10 minutes between alerts and the start of work respectively, that would total 63 minutes. These 63 minutes, divided by 3 (the number of incidents), give you an MTTA of 21 minutes.

As an equation, that’s:

Total time between alerts being triggered and work starting / Number of incidents = MTTA Or, as in the example, (23 minutes + 30 minutes + 10 minutes) / 3 = 21 minutes MTTA

Mean Time to Recovery (or Repair, Respond, or Resolve) (MTTR)

MTTR can be a confusing metric because it actually measures different things depending on what your team takes the “R” in MTTR to mean. For the purposes of this article, this section will focus on mean time to recovery, as it’s one of the more common metrics among teams.

The mean time to recovery is the average time it takes to recover from an incident or outage, measured from when the incident first occurs to when normal operation is restored. It’s a useful metric to measure the effectiveness of incident management teams because it gives an overview of the entire incident.

As incidents are resolved more quickly, MTTR will decrease, thus showing that incidents are being handled more efficiently.

To calculate MTTR, you first need to add up the total amount of downtime over a given period. Let's say we had 3 incidents where the system was down for 15, 45, and 90 minutes. If we add this time, we get a total of 150 minutes of downtime over the period we observed. When we divide that by the 3 incidents that occurred, we get an MTTR of 50 minutes, meaning it took on average 50 minutes to recover from an incident.

As an equation form that’s:

Total downtime in the period/number of incidents = MTTR Or, as in the example, (15 minutes + 45 minutes + 90 minutes) / 3 = 50 minutes MTTR

On-call Time

On-call time is one of the most important incident management KPIs. It measures how long responders are on call and available to address incidents. The longer responders are on call, the more stretched they become and the less time has for other work. This can lead to responders being less effective when they are actually needed, which can in turn lead to longer MTTRs and more downtime.

During an on-call rotation, it can be useful to track how much time team members spend on a call. This can ensure no one employee or team is overworked.

Service Level Standards

Measuring your incident management KPIs is important, but to ensure that your team can work towards improving performance, you need to set appropriate and reasonable standards for their responses.

Service level agreements, objectives, and indicators define how you and your team promise to deliver services to your stakeholders, the specific steps needed to keep those promises, and how to measure whether you’re fulfilling those promises as you manage incidents.

slo-sla-sli

Service Level Agreements (SLA)

A service level agreement is between the incident management team and other stakeholders on what response level can be expected. These agreements are usually response SLAs, in other words, the maximum time before the incident management team responds, and resolution SLAs, in other words, the maximum time before an incident is resolved.

This is important information to agree on because it keeps everyone on the same page with regard to the proposed incident management timeline.

Different levels of incidents often come with different SLAs. For example, a Priority 1 incident might have a response SLA of fewer than 30 minutes, while a lesser issue might have a bit more leeway in terms of response and resolution time.

Service Level Objectives (SLO)

SLOs are a subset of SLAs in that they’re specific objectives that your team agrees on in order to deliver on the promise of your SLAs.

It's important that SLOs are simple and clear so that everyone on the team can follow them and not have them distract from actually managing the incident at hand.

It's important to not go overboard with SLOs, as this will confuse the team and make the incident management process less efficient. They should be reserved for client-facing issues that are specifically mentioned in SLAs.

Service Level Indicators (SLI)

While SLAs are an agreement with your customers and SLOs are the specific ways you deliver on that agreement, an SLI is an actual measurement over a given time period that shows whether your team is meeting the agreement set out in the SLA.

For example, if you have an SLA for overall system uptime of ninety-nine percent, and your SLOs are in line with this, then your SLI will be an actual measure of what uptime your system had over a given period. If it's less than 99%, then changes need to be made to bring your system back in line with the agreed-upon SLAs.

Incident Management KPIs Report

Measuring these KPIs and sharing the results is important for communicating your performance improvements and degradations alike with key stakeholders. While the exact format varies by organization, a straightforward report showing all KPIs should be able to convey the current goals and priorities of your organization. In this effort, including historical figures for context on your baseline is particularly important.

For example, without a baseline, it’s impossible to know if a 30-minute MTTD is good, but if your MTTD last month was 60 minutes, then this is a demonstrably great improvement! However, if your organization frequently has MTTD under the 10-minute range, then something has gone seriously wrong.

Final Thoughts

Measuring how your incident management process is working is very important, especially over time, as you need to know whether it's improving and becoming more efficient or getting worse. This article looked at some of the key metrics that can be used as part of the incident management process to track these changes.

Ensuring that data is collected consistently and accurately can be challenging. Using a platform like LightstepLightstep can keep your team on the same page when it comes to incident response. It can also help them track all the necessary metrics and recover from incidents faster and more efficiently.

Using Lightstep’s integrations and customizable notifications, you can make sure your entire technology stack works together for seamless incident management, allowing you to get your service back to normal even more quickly.

October 11, 2022
12 min read
DevOps Best Practices

Share this article

About the author

Keanan Koppenhaver

Keanan Koppenhaver

Read moreRead more

Exploring What Kubernetes Observability Might Look Like for SRE and Operations Teams in the Future

Clay Smith | Oct 19, 2022

The exciting and new tracing capabilities now built-in for the internal components that power Kubernetes itself, which means that operators that need to diagnose tricky performance issues have some powerful new solutions.

Learn moreLearn more

Overview of Site Reliability Engineering

Lukonde Mwila | Sep 22, 2022

Site Reliability Engineering has become more common over the past few years, and many more are looking at it trying to understand what exactly it means. In this guide you’ll be covering this area, giving a high-level overview of SRE.

Learn moreLearn more

Leading SRE with Empathy

Ana Margarita Medina | Aug 10, 2022

Writing and operating software is hard, lets lead Site Reliability Engineering with Empathy where we relate to other human beings by being curious, listening, offering help while building trust.

Learn moreLearn more
THE CLOUD-NATIVE RELIABILITY PLATFORM

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems