What is Alert Fatigue and How to Reduce It
by Dan Woods
Alert fatigue occurs when monitoring systems send so many alerts that those alerts are ignored or become overwhelming in volume that they cannot be responded to. Anyone familiar with the “Boy Who Cried Wolf” parable can understand alert fatigue — too many false alarms can lead to people ignoring problems because they’re so used to the alarms being misleading or too abundant to manage.
Alert fatigue, also called alarm fatigue, is alive and well in many aspects of our lives. Whether in a car where the “check engine” light went on no matter how many times the car was serviced or a fire alarm that eventually gets shut down because it went off so frequently from everyday home-cooking. Alerts only help us if we’re paying attention and their meaning is clear.
However, alert fatigue in the IT realm is complicated and can occur more than just when alarms are ignored. There is also the issue of alert fatigue occurring when excessive alarms lead to confusion in those who are supposed to take action based on the alerts.
The question for IT departments and companies is how do they deal with alert fatigue? Enterprises want to minimize alert fatigue because once it sets in, then, clearly, the alerts are no longer doing their job. The challenge is to minimize alert fatigue without removing important notifications that responders actually need to be informed about. Alert fatigue is not when a company has so many alerts, all of which are important, that it just can’t respond to all of them — that’s a crisis not fatigue. Instead, alert fatigue is when the alerts are so incessant and often not important, that they end up being ignored. Alert fatigue is a sign of failure of the alerting system. It happens when companies have so many alerts, users can’t tell what’s actually wrong.
In all realms, including medicine or automobiles, alert fatigue is caused by over production of alerts. But in incident response in IT, the problem is often more acute. In many cases in IT, alert fatigue is caused by poor alert monitoring and incident response design, with thresholds being set improperly or companies not having feedback loops in place to constantly reassess and improve the way in which their alerts are set and processed.
When companies experience alert fatigue, the annoyance can be profound, and they may opt to turn off alerts entirely. This can become dangerous, however, as the alerts were there for a reason, even if they were not functioning optimally. Consequently, with alerts turned off, it becomes impossible for companies to understand what’s going on with their systems.
What’s worse is that automation doesn’t always help to alleviate or prevent alert fatigue. In fact, it can actually contribute to it, as automated alerts can make alert fatigue worse. If thresholds are set incorrectly, automated alert monitoring can lead to even more alerts being generated. Automated systems can also expand the number of users who experience alert fatigue, as it increases the number of people who receive alerts and can ensure those alerts are sent to every device on which those users are active.
The consequences of alert fatigue — both to individuals and organizations — can be severe. One of the major issues with alert fatigue is that it leads to the development of three types of behavior that result in alerts being ignored, overlooked, misunderstood, or going without a proper response. These include:
- Normalization: When previously unusual or atypical behavior is normalized and seen as the standard. In the case of incident response and alert fatigue, this can occur when copious alerts going ignored or unresponded to becomes the norm.
- Desensitization: When people become insensitive to something that should elicit a response. With alert fatigue, this can be accepting that there are so many alerts that no one can act on them as standard operating procedure.
- Habituation: When people develop a decreased response to situations or problems that should not be considered normal. In the case of alert fatigue, this means accepting that the number of alerts a company has will be overwhelming rather than acting to change it.
Normalization, desensitization, and habituation are likely familiar to everyone — and not just in alert fatigue situations. For example, binge drinking on college campuses is commonly normalized despite it leading to significant negative consequences for those who drink, as well as those they interact with.
Essentially, all three of these concepts boil down to the same idea: with alert fatigue, companies, and the people that work in them, come to tolerate, normalize, and ignore alerts. Or they end up missing alerts that are important because they are inundated with so many they can’t separate the vital from the less essential. Ultimately, this means the alert system has failed as it indicates alerts are going unheeded.
The consequences of alert fatigue on the individual level may be increased burnout for those in incident response reliability and response, as the job is already stressful as it is, and alert fatigue can make it worse. If every time an employee is attempting to log into a system and receives hundreds of alerts that he or she has to manually sort through to understand what’s going on, they will end up not being able to do their real job or end up dreading it. Elsewhere in enterprises, other users may not pay attention to possible hacks of cyberattacks if they’re experiencing alert fatigue.
Ultimately, the goal of alerting is to have a system that helps teams do their jobs instead of getting in the way. When alert fatigue sets in, the systems are actively hurting the business. Alert fatigue hides the important information instead of revealing it. Because related alerts are not properly grouped together, companies end up not being able to surface the information that really matters, namely, the fundamental problems that have caused all the alerts being issued.
Operations teams should strive to trim their alerts to the ones that really matter — i.e., the vital signs — so that they will be alerted about these, and not others that just muddy the water. The goal is to have an alert system that properly analyzes and groups alerts, highlighting root causes, rather than just showing so much duplicated evidence of symptoms that operations teams up ignoring or missing alerts they should be paying attention to.
So how should operations teams combat alert fatigue? They need to implement processes to ensure they have a full handle on their monitoring and alerting systems that allows them to understand the vital signs of their business, and that they have defined normal states and thresholds accurately. They also must understand the dependencies between alerts to fully grasp how the entire system is being impacted and why the alert was being generated in the first place. Most importantly, operations teams must put in place a continuous improvement process because no enterprise can know the proper thresholds and vital signs right away. It’s a process that involves time and reflection in which every cycle leads to improvement.
Ultimately, alert fatigue comes not just from the alerts themselves, but also responding to the alerts. Better preparation and additional contextual information about the alerts can help users respond more precisely to alerts that do arise. Improved automation of alerts can also help, as users will know their efforts are going to be rewarded and that their suffering isn’t going to be in vain.
To achieve this, operations teams should follow a process such as this:
- Preparing for alert management
- Connecting to sources of alerts
- Defining the thresholds that define alerts
- Designing the information included in an alert
- Monitoring those sources for alerts
- Defining how an alert will be enhanced with additional information
- Categorizing alerts based on where they are generated and what they indicate about what is occurring
- Associating alerts to various systems
- Processing alerts once they occur:
- Routing the alert to the right team
- Enhancing the alert with additional information and context
- Grouping alerts that seem to be related or associated with specific systems, or deduplicating alerts that are about the same issue
- Creating tickets which are used to assign activities to individuals or teams to respond to issues
- Associating alerts with incidents so that multiple alerts related to a single incident or issue can all be brought together
- Associating alerts with runbooks, as runbooks can help companies resolve specific issues
- Notifying responders about specific alerts so that they can respond
- Presenting responders with alert analytics so when they receive a ticket they can then understand how to respond
- Executing automated responses
- Updating and interpreting analytics
- Letting the entire company know about an outage with a mass notification
- Responding after an alert happens:
- Assembling data for a post mortem
- Ensure alerts are entered into an alert database so learning and the identification of trends can happen over the long-term
- Creating suggestions for improvements such as thresholds, whether runbooks need to be updated, or whether the updates were directed to the right people and teams.
Companies that are ready to take charge of their alerting systems and avoid alert fatigue should learn more about Lightstep Incident Response.