by Dan Woods
Alert management is a key part of managing modern, complex systems that are composed of many different APIs and microservices. Alerts are qualified, informational records used to assess and monitor systems to assess current or upcoming problems.
Alert management generally occurs within IT or operations departments and provides them the ability to be able to be notified and aware of current or potential upcoming issues with how their services and systems are functioning. Another name for this is incident alert management, which allows companies to respond to major incidents based on critical alerts that notify teams of potential problems.
By monitoring alerts, companies are not just monitoring everything for its own sake; rather, they are discovering the vital signs for their enterprise systems so that if anything is going wrong with the most important processes, they can be made aware of it immediately.
The goal of an alert system is to define what’s normal and what’s not. Then, when something is not normal, alert management systems of various kinds should be able to turn that into a message and perhaps enhance it with additional information, route that message to the right person or team, group multiple alerts together when appropriate, and then allow that team to have the tools that are needed to address the issue. Those monitoring tools are incident response management systems, ticket systems, automation systems, and the alert itself is often driven by various observability or monitoring systems. Alert management is the process of defining and executing this entire scope of work, while also being able to generate some workflow automation.
Alert management is important because it helps to ensure systems are running properly, which avoids disruptions to revenue and the vital operations of the business. The more revenue that can be lost when a particular system fails or experiences an outage, the more important it is to keep that system running fully. The cost of downtime can be huge, depending on the system that is affected — Gartner estimates that the average loss of revenue caused by downtime is $5600 per minute. Alert management is a key part of finding problems as early as possible and then being able to respond to issues to avoid outages. Companies need an alert management workflow to ensure they can respond adequately to problems and that they have the right incident management team in place.
Efficient alert management allows companies to increase productivity, reduce downtime, reduce performance degradation, and resolving issues to ones that do arise more quickly by generating alerts.
The alert management process consists of monitoring systems, defining what’s normal, and notifying the right people when something abnormal occurs so that the right teams can be made aware of the issue and resolve it. It’s useful to prioritize adding the instrumentation and metric thresholds for the most used and valuable services before spending the extra time and effort bolstering monitoring and alerts to lower the impact of services. Over the long term, companies should plan to build full coverage and visibility into the health of their systems.
What are the activities involved in alert management? The process often operates as follows:
- Connecting to sources of alerts
- Defining the thresholds that define alerts
- Designing the information included in an alert monitoring those sources for alerts
- Defining how an alert will be enhanced with additional information
- Defining the on-call schedule to decide who will receive the alerts
- Categorizing alerts based on where they are generated and what they indicate about what is occurring
- Associating alerts to various systems
- Routing the alert to the right team
- Enhancing the alert with additional information and context
- Grouping alerts that seem to be related or associated with specific systems, or deduplicating alerts that are about the same issue
- Creating tickets that are used to assign activities to individuals or teams to respond to issues
- Associating alerts with incidents so that multiple alerts related to a single incident or issue can all be brought together
- Associating alerts with runbooks, as runbooks can help companies resolve specific issuesNotifying responders about specific alerts so that they can respond
- Presenting responders with alert analytics so when they receive a ticket they can then understand how to respond
- Executing automated responses
- Updating analytics dashboards in real-time
- Communicate to the entire company know about an outage with a mass notification
- Assembling data for a post mortem if the alert has been promoted to an incident
- Ensure alerts are entered into an alert database so learning and the identification of trends can happen over the long-term
- Generate suggestions for improvements such as thresholds, whether runbooks need to be updated, or whether the updates were directed to the right people and teams.
Determining what is normal and what’s not is an experimental process. Companies have to learn proper alert thresholds over time by refining and reflecting based on previous incidents, as setting thresholds too low can lead to excessive alerts and alert fatigue, while setting them too high can mean critical problems are overlooked. Using the ITIL alert categorization system can be helpful to determine the severity of the event the alert is bringing forward.
Each alert should also bring with it a set of expected responses, such as manual remediation, recognizing whether the alert requires attention, creating an incident or security incident or case, closing the alert once it is resolved, and the ability to reopen an alert if needed. Companies should also have a way to turn off alerting when a system is undergoing planned maintenance chances.
An alert management system is a crucial component of being able to understand, gain transparency, respond, and act proactively towards issues that could impact the functioning of major IT software and systems within an enterprise. The more finiely tuned alert management system a company has, the better the business will become at getting ahead of issues that can lead to downtime and outages. Learn more about IT alerting and why it is important.
For more information on Lightstep’s to alert management solution, you can view the website here