What Is Alert Fatigue And How To Reduce It
by Dan Woods
Alert fatigue occurs when monitoring systems send so many alerts that those alerts are ignored or become so overwhelming in volume that they cannot be responded to. Anyone familiar with the “Boy Who Cried Wolf” parable can understand alert fatigue — too many false alarms can lead to people ignoring problems because they’re so used to the alarms being misleading or too abundant to manage.
Alert fatigue, also called alarm fatigue, is alive and well in many aspects of our lives. Whether in a car where the “check engine” light went on no matter how many times the car was serviced or a fire alarm that eventually gets shut down because it went off so frequently from everyday home cooking. Alerts only help us if we’re paying attention and their meaning is clear.
However, alert fatigue in the IT realm is complicated and can occur more than just when alarms are ignored. There is also the issue of alert fatigue occurring when excessive alarms lead to confusion in those who are supposed to take action based on the alerts.
The question for IT departments and companies is how to reduce alert fatigue and avoid it altogether. Enterprises want to minimize alert fatigue because once it sets in, then, clearly, the alert management system is no longer doing its job. The challenge is to minimize alert fatigue without removing important notifications that responders actually need to be informed about.
Alert fatigue is not when a company has so many alerts, all of which are important, that it just can’t respond to all of them — that’s a crisis, not fatigue. Instead, alert fatigue is when the alerts are so incessant and often not important, that they end up being ignored. Alert fatigue is a sign of failure of the alerting system. It happens when companies have so many alerts, users can’t tell what’s actually wrong.
In all realms, including medicine or automobiles, alert fatigue is caused by overproduction of alerts. But in incident response in IT, the problem is often more acute. In many cases in IT, alert fatigue is caused by poor alert monitoring and incident response design, with thresholds being set improperly or companies not having feedback loops in place to constantly reassess and improve the way in which their alerts are set and processed.
When companies experience alert fatigue, the annoyance can be profound, and they may opt to turn off alerts entirely. This can become dangerous, however, as the alerts were there for a reason, even if they were not functioning optimally. Consequently, with alerts turned off, it becomes impossible for companies to understand what’s going on with their systems.
Automation Can Help, or Hurt, Alert Fatigue
What’s worse is that automation doesn’t always help to alleviate or prevent alert fatigue. In fact, it can actually contribute to it, as automated alerts can make alert fatigue worse. If thresholds are set incorrectly, automated alert monitoring can lead to even more alerts being generated. Automated systems can also expand the number of users who experience alert fatigue, as it increases the number of people who receive alerts and can ensure those alerts are sent to every device on which those users are active.
The psychological consequences of alert fatigue — both to individuals and organizations — can be severe. One of the major issues with alert fatigue is that it leads to the development of three types of behavior that result in alerts being ignored, overlooked, misunderstood, or going without a proper response. These include Normalization, Desensitization, and Habituation:
Normalization: Accepting the Worst of Alert Fatigue
When previously unusual or atypical behavior is normalized and seen as the standard. In the case of incident response and alert fatigue, this can occur when copious alerts going ignored or unresponded to becomes the norm.
Desensitization: Ignoring Important Alerts
When people become insensitive to something that should elicit a response. With alert fatigue, this means accepting that there are so many alerts that no one can act on them as standard operating procedure.
Habituation: Accepting Bad Alert Management Habits as Normal
When people develop a decreased response to situations or problems that should not be considered normal. In the case of alert fatigue, this means accepting that the number of alerts a company has will be overwhelming rather than acting to change it.
Normalization, desensitization, and habituation are likely familiar to everyone — and not just in alert fatigue situations. For example, binge drinking on college campuses is commonly normalized despite it leading to significant negative consequences for those who drink, as well as those they interact with.
Essentially, all three of these concepts boil down to the same idea: with alert fatigue, companies, and the people that work in them, come to tolerate, normalize, and ignore alerts. Or they end up missing alerts that are important because they are inundated with so many they can’t separate the vital from the less essential. Ultimately, this means the alert system has failed as it indicates alerts are going unheeded.
The consequences of alert fatigue on the individual level may be increased burnout for those in incident response reliability and response, as the job is already stressful as it is, and alert fatigue can make it worse. If every time an employee is attempting to log into a system and receives hundreds of alerts that he or she has to manually sort through to understand what’s going on, they will end up not being able to do their real job or end up dreading it. Elsewhere in enterprises, other users may not pay attention to possible hacks of cyberattacks if they’re experiencing alert fatigue.
Ultimately, the goal of alert management is to have a system that helps teams do their jobs instead of getting in the way. When alert fatigue sets in, the systems are actively hurting the business. Alert fatigue hides the important information instead of revealing it. Because related alerts are not properly grouped together, companies end up not being able to surface the information that really matters, namely, the fundamental problems that have caused all the alerts being issued.
Operations teams should strive to trim their alerts to the ones that really matter — i.e., the vital signs — so that they will be alerted about these, and not others that just muddy the water. The goal is to have an alert system that properly analyzes and groups alerts, highlighting root causes, rather than just showing so much duplicated evidence of symptoms that operations teams up ignoring or missing alerts they should be paying attention to.
Setting Alert Thresholds
Alerts thresholds determine when an alert is created. The job of an alert is to let the operations team know that something important happened. Sometimes alerts are just saying that something abnormal occurred that may or may not be a problem. Other times alerts are about a critical failure that must be addressed immediately. The right threshold for an alert should be set based on understanding both from a historical analysis and from the operational architecture of a system. Using judgment is the first step in determining alert thresholds. The second step is adjusting them in response to experience, increasing or decreasing thresholds as needed. Finally, alerts must keep up with changes to the architecture of the system.
Alerts can be informational, indications of a potential of a problem, or the equivalent of a fire alarm, with many shades of gray in between. Ideally, the alerts and the system for automation of analysis and responses are designed and evolved together. Informational alerts can be cataloged for future analysis. Emergence of a bunch of alerts about a potential problem can be analyzed and routed to the team to get ahead of a problem. Fire alarm alerts can let everyone at all levels of the business know so action can be taken. The right tiers for alerts are based on the sophistication of the response. The more people and specialties involved, the more tiers there should be.
Alerts should be tagged so they can be grouped in as many ways as possible. Alerts usually have a time stamp, an indicator of the system and component being tracked, an indicator of geography, and possibly other tags that show key factors determined by the alert designer. Alerts should also have relevant detail about the context of the system producing the alert, including key vital signs. With all this information in place, an incident response system can group alerts so that the SREs and operations teams can quickly understand what’s happening.
Automating Alert Responses
The automation of responses to alerts is where victory over alert fatigue is achieved. At first, when lots of alerts first arrive the task is to automate the grouping and analysis of the alerts. Then by tracking responses, it becomes possible to see repeated actions and automate them. At first the automations are simple, just a single task like restarting a server. But then patterns of those simple tasks emerge and several steps can be executed at once. In this way a large number of alerts can be understood, handled, and then only the really weird cases end up being analyzed in a manual way. Victory over alert fatigue comes not from shutting down the production of alerts but by creating an assembly line that can process most of them.
Alert fatigue is a real problem for many organizations. Too many alerts can lead to operations teams ignoring or missing alerts that they should be paying attention to. Here are some ways to fight alert fatigue and to manage it properly:
Using Alerts to Understand Vital Signs
SRE and DevOps teams need to implement processes to ensure they have a full handle on their monitoring and alerting systems that allows them to understand the vital signs of their business, and that they have defined normal states and thresholds accurately.
Understand and Document Dependencies in Systems and Alerts
SRE and operations teams also must understand the dependencies between alerts to fully grasp how the entire system is being impacted and why the alert was being generated in the first place.
Consistant Alert Management
Most importantly, SRE and operations teams must put in place a continuous improvement process because no enterprise can know the proper thresholds and vital signs right away. It’s a process that involves time and reflection in which every cycle leads to improvement.
Ultimately, alert fatigue comes not just from the alerts themselves, but also responding to the alerts. Better preparation and additional contextual information about the alerts can help users respond more precisely to alerts that do arise. Improved automation of alerts can also help, as users will know their efforts are going to be rewarded and that their suffering isn’t going to be in vain.
To achieve this, operations teams should better prepare for alerts, process them precisely, and learn from them. Here are some tips for all three areas:
- Connecting to sources of alerts
- Defining the thresholds that define alerts
- Designing the information included in an alert
- Monitoring those sources for alerts
- Defining how an alert will be enhanced with additional information
- Categorizing alerts based on where they are generated and what they indicate about what is occurring
- Associating alerts to various systems
- Routing the alert to the right team
- Enhancing the alert with additional information and context
- Grouping alerts that seem to be related or associated with specific systems, or deduplicating alerts that are about the same issue
- Creating tickets which are used to assign activities to individuals or teams to respond to issues
- Associating alerts with incidents so that multiple alerts related to a single incident or issue can all be brought together
- Associating alerts with runbooks, as runbooks can help companies resolve specific issues
- Notifying responders about specific alerts so that they can respond
- Presenting responders with alert analytics so when they receive a ticket they can then understand how to respond
- Executing automated responses
- Updating and interpreting analytics
- Letting the entire company know about an outage with a mass notification
- Assembling data for a post mortem
- Ensure alerts are entered into an alert database so learning and the identification of trends can happen over the long-term.
- Creating suggestions for improvements such as thresholds, whether runbooks need to be updated, or whether the updates were directed to the right people and teams.
Companies that are ready to take charge of their alerting systems and avoid alert fatigue should learn more about Lightstep Incident Response.