Lightstep from ServiceNow Logo

Products

Solutions

Developers

Resources

Login

Lightstep from ServiceNow Logo
Technical

What is IT Alerting?


Dan Woods

by Dan Woods

What is IT Alerting?

Explore more Technical Blogs

Introduction

What Is IT Alerting?

At the highest level, IT alerting is about understanding whether complex systems are running properly. In a complex system, there are many inputs and outputs and lots of dependencies. To know if everything is working means monitoring hundreds or thousands of metrics. To discover and resolve problems, alerts are generated when metrics reveal something that is not normal, which generally occurs when a value is above or below a threshold, or when a value enters a certain percentile. Then, the operations staff receives those alerts and can decide how to address those issues.

The easiest way to understand incident alerting in the IT realm is to think of other systems where alerts are commonly used. For instance, with an automobile, dashboard alerts notify drivers about general issues, such as “maintenance needed” to more specific problems, like when a tire is low or a door or trunk is left open. Or, to use another alerting analogy, we often have security guards at facilities or buildings and their job is to alert more senior law enforcement personnel, such as police, if a major problem is occuring.

In the IT realm, we see alerts about the mis-performing parts of complex systems. A whole range of alerting software has been created to generate and field alerts. Sometimes, this leads to an alert storm that makes it difficult to determine what is really wrong. This can cause what is often called “alert fatigue”, where there are so many alerts that users stop paying attention to them — it’s akin to the classic children’s parable, “The Boy Who Cried Wolf.” As a result, alerting is recommended to be routinely reviewed to manage threshold definitions to ensure alerts aren’t too noisy — or not noisy enough.

What Is Alert Management?

In IT, alerting reflects a system explaining that there is something abnormal occuring that users should know about and on which action should possibly be taken. Alert management is the process involved in deciding what should be done about those alerts. This includes figuring out which alerts should go to which users and how automation can be employed to make the process as efficient as possible. Alert management is useful in detecting impending issues, or bringing awareness to existing issues. Prioritization in alerting based on the severity of the breached metric can inform how ‘loud’ the notification will be, from a phone call to a simple email.

Alert management also involves enhancing alerts and recommending responses so alerts are addressed in a timely and efficient manner. Alert management thus has many practices and alerting tools associated with it, such as:

  • Incident response, which involves taking an incident and resolving it;
  • Observability, which involves the ability to observe what is going on in complex systems;
  • IT service management, which is understanding how you manage the operation of a large and complex IT landscape; and
  • Even more advanced practices that include AI and ML categorization of alerts and capacity planning.

Why are Monitoring and Alerting Important?

Monitoring and alerting are important because IT systems are so complicated today that it is hard to understand whether everything is running correctly. Companies need the ability to comprehend whether there are problems occurring. For instance, a change may have introduced an unexpected error. Or there could be a cybersecurity attack underway, or a part of the system may be failing, or there could be a sudden and unexpected spike in usage. With the complexity of modern systems, it’s impossible to have humans do this type of monitoring and alerting. Companies must have automated systems performing the analysis.

What Is the Difference between Monitoring and Alerting?

Monitoring is simply capturing the data to understand what is occurring with a specific metric, such as the number of users or requests. Alerting, comparatively, is when a monitoring system crosses a threshold and indicates something abnormal is happening. An alert is therefore observing the monitoring and assessing when there are issues that those in charge of the system must be aware of. The whole domain of observability has been created because monitoring and alerting has become so important.

What Are Alerting Notifications?

Alerting notifications are essentially messages that are sent out when an abnormality is detected during the monitoring process. The first question though is to who and to where that message should be sent? Understanding ownership of the failing component or service is key to making sure you are routing the alert to the right team, and right on-call user. The second question is what additional information is needed for the person who receives the notification to be able to make an informed decision about how to respond to it. Another way to ask this is what is an alert in service operations?

Why Is IT Alerting Important?

IT alerting is done to ensure companies can understand whether their systems are operating correctly. With such complex systems built on dozens of services, as many systems are now, it’s a huge challenge to fully comprehend whether every part is functioning optimally.

Alerting is important because it allows companies to gain this understanding. If a business has a reliable alerting system that does not produce alert fatigue, it can get ahead of problems before they spiral out of control, and understand if there’s a cyber attack or usage spike as soon as possible, rather than waiting to do something until it’s too late. Alerting allows companies to be proactive, which is essential in today’s IT environment. Alerts are usually put in place all over the landscape, including server alerting, network alerting, edge alerting, and alerting on personal devices.

Why Is Alerting Done?

Therefore, IT alerting is done so companies can get ahead and to ensure their systems can run more reliably and resiliently. Additionally, alerting can help companies avoid down times, which can be extremely expensive — according to Gartner, down time costs companies an average of $5,600 per minute.

IT Alerting System Requirements

What Is the Purpose of Alerting?

Earlier, this article covered how alerts occur to surface abnormalities in systems. But once alerts are in place, companies can also understand the severity of the alert. The ITIL system categorizes alerts in the following table, which shows that alerts can be categorized based on their severity and what they are telling you about cybersecurity, DevOps, and system failures:

Alert SeverityDescription
CriticalA failure in the system's primary application.
ErrorAny error that is fatal to the operation, but not the service or application.
WarningMay indicate that an error will occur if action is not taken.
InfoNormal operational messages that require no action.

Additionally, when the alerts start to be generated, they could be coming from related systems like observability, which can then be used in ITSM ticketing, incident response, or automation operations to resolve them. Often incident alert templates are created to provide a common structure for alerts. Or, if an alert is major enough, depending on the number of system alerts triggered, the entire company may need to be notified, which could lead to companies establishing mass notification systems like a critical event management system to ensure everyone knows what is happening.

How Does an IT Alerting System Work?

An IT alerting system generally has a number of layers that helps users to recognize issues and then be able to pass it on to the proper departments or individuals to ensure it can be resolved. These layers generally include:

  • Connect: Alerts should connect to as many monitoring systems as possible.
  • Collect: Alerts should be collected as they occur into an inbox or system so that they can be understood.
  • Categorize: Alerts should then be categorized. This categorization is the first step in grasping what is happening.
  • Correlate: Alerts should then be correlated to see how many are related to a single problem. This is where a company may declare an incident, which may have many individual alerts grouped underneath it.
  • Enhance: Companies can then look at that incident and enhance their understanding by adding additional data, whether manually or through automation.
  • Collaborate: Companies can collaborate, bringing together all the experts on a related topic to understand an alert.
  • Analyze: Once that team is together, they can analyze the problems the alert is highlighting.
  • Respond: Once the analysis has been completed, companies can respond to resolve the issue. Runbooks also help with this step.
  • Learn: The final step is to take the alerts from the incident to learn how to prevent such problems in the future or to be more proactive should they arise again. When companies learn from the experience, they may change the way alerts are set based on threshold levels, or fix how problems are responded to in the future. This learning can be formalized in a post-mortem document. Additionally, alert analytics dashboards can show insights on where alerts are originating — and when.

Create an Effective IT Alerting Strategy

To create an effective IT alerting strategy, companies need to do the following:

  • Understand the landscape

    • It’s crucial for companies to have a sophisticated model of what they are tracking as this is a huge help in sorting what alerts matter during an alert storm. The more sophisticated model a company has of its landscape, the better it will be able to understand and respond to alerts.
  • Understand the teams

    • Companies should be able to answer the following questions when alerts arise, as the more sophisticated model they have to handle alerts, the more deft they will be at solving the issues:
      • Who can help with what kind of alert?
      • Who are the generalists?
      • Who are the specialists?
  • Create and update detailed plans

    • When companies start out with alerts, they may have Runbooks that are relatively high level. But as time goes on, the Runbooks become more detailed and offer greater insights. These Runbooks become the plan that help guide the company’s response to specific alerts. Understanding and improving Runbooks must be part of the alerting process.
  • On-call Automation

    • Companies should strive to automate the channeling of an alert to a team. This can be done most effectively when companies have a thorough understanding of their landscapes. This leads to the establishment of rules for who handles specific types of alerts.
  • Automation collaboration

    • Once a response team is assembled, automation collaboration can occur in as many ways as is needed. This could involve bringing the team together, using Zoom, Teams, Slack, or other tools, often in an automatic fashion.
  • Apply tech to sort things out

    • Many of the modern systems use ML and AI to analyze alert storms to understand what is happening at a deeper level. These systems can not only automate the processing of the alert, but also the selection of the response team depending on the nature of the alert.
  • Continuous learning and improvement

    • Companies must make the time to do this and automate as much as possible. The goal is blameless post-mortems, so that everyone involved can understand what happened. The idea is to learn as thoroughly as one can from each incident to avoid issues in the future and make improvements.

IT Alerting Best Practices

To achieve efficient and effective alerting, companies should implement the following best practices:

  • Automate responses — The more companies can reduce human toil and human involvement responding to alerts, the faster issues can be resolved. Therefore, having a sophisticated automation environment for alerts is essential.
  • Learn from experience — Companies have to take their experiences and improve their responses at every level of observability using past alerts/incidents, post-mortems, and analytics.
  • Increase resiliency — Companies can look at what is happening over the longer-term and understand whether the design of the systems need to change or whether particular parts are more brittle than others. Companies can be proactive about this using techniques like chaos engineering to understand their weak points by putting forward errors on purpose to see if their systems can handle them.
  • Capacity planning — Finally, companies should try to understand the capacity of each of their systems, and the ability to support growth of each of these systems so that they can get ahead of issues over time.

Interested in making the most of alerting? Sign up for a free trial of Lightstep Incident Response  

Explore more Technical Blogs