An event is a change or occurrence in the normal operations of a system network, process, or workflow. Events can be triggered by manual input, such as pressing a button, or they can be generated automatically. They can be pulled (observed) or pushed (logged) to a system for tracking.
When working on-call incident response, it's important to be aware of events that could potentially lead to an incident. You may need to take action to prevent an incident from happening, or to mitigate the effects of one that has already occurred.
The highest volume of data you collect with monitoring is around individual events, which are points in time, and transactional records around the performance of a particular metric of interest. These metrics are generally around items like saturation, latency, error rate, or traffic.
Examples of an event could be 55% CPU usage, 75% semaphore usage, 200K concurrent users, 75F temperature, 404 responses from a web page, 500 errors on API calls, and other point-in-time snapshots of a system's state. You use this ‘raw’ event data to track system changes over time plotted onto visualized metrics, and trigger logic to escalate an event to an alert if a performance threshold was breached.
An alert is something that needs your attention but doesn't necessarily require an immediate response. The main purpose of alerts in incident response systems is to get the right person's attention to investigate a potential issue related to a service that they own or support. Alerts need to happen close to real-time as if the issue is validated restoration needs to happen as quickly as possible to keep MTTD (mean time to detect/diagnose) and MTTR (mean time to resolve) low. Alerts are primarily informational qualified events where predefined system logic indicates an issue that a human need to validate and decide if an incident is required. Only your technical on-call SME team is working on the alerts. Alerts are qualified events because events are constantly collecting and tracking data related to a metric and monitoring, such as latency, errors, saturation, etc. Whereas events are simply ‘raw data’ independent of the context of that data being good or bad. Alerts are a qualified set of these events deemed bad, associated with a notification to inform the technical team.
An incident is an event that negatively affects an organization and requires immediate attention. To restore a degraded service to agreed operational levels as quickly as possible, incidents usually involve a large team of responders, sometimes cross-functionally, as they work together to diagnose and remediate the issue. A big distinction is that incidents are confirmed service degradations, and as a result will be a higher priority to close than alerts. Compared to alerts, the sensitive nature means there is an additional process around defining response roles, stakeholder communication, and postmortems to ensure incidents are completed smoothly. As a result, the users involved in incident response range from technical on-call SMEs and incident commanders to scribes and customer liaisons/communication leads.
Let’s discuss how events, incidents and alerts are created to understand the nuances among each.
How Are Events, Alerts, and Incidents Created?
Events and Alerts
Let’s begin with how these different objects are created. Events and alerts are generally always system generated. This means that there is not a human in the loop to create the alert record. So, what creates alerts? Alerts are created by your monitoring and observability systems, which look at different metrics related to your environment's health. This could be latency, errors, traffic, saturation, logs, or other areas you are monitoring. These systems should have thresholds set up, either at fixed levels (e.g. 4 seconds latency) or quartiles (3rd quartile), and they are constantly monitoring and collecting events related to the health of the system at a given point in time. When an event is registered with values outside the defined safe thresholds, then the monitoring/observability tool will send the qualified alert into your incident response tool when that specific threshold is breached. In this manner, alerts are qualified events that indicate some level of predefined danger or degradation. We also see alerts being generated from integrations with other systems, whether they are SIEM or ITSM tools looking to escalate an alert to try to get the attention of an on-call resource to investigate something. In all cases, it is usually a system-defined logic and integration triggering the alert to be opened. These channels include the mentioned observability and monitoring, generic webhooks called by apps or custom scripts, emails, or for the exception case manually created alerts for testing or to get someone's attention.
Below we showcase an alert detail view within Lightstep incident Response, showing the alert context, a timeline of alert history, and any active collaboration channels/responders. Related tabs can be seen for child alerts and automation.
Incidents As we previously established, incidents are validated service degradations commonly impacting your revenue-generating applications, internal employees, or external customers. While not every alert gets promoted to an incident, it is common to manually promote a verified alert that is linked to a degraded service into an incident. You can also define specific policies for when an alert should automatically be escalated into an incident, due to the severity (app unreachable, DNS not resolving, etc.). So how else do incidents get created, other than through the promotion of an alert? Manual reports are another common channel for incidents, as sometimes you find you are missing out on your monitoring coverage and visibility. Customers and users will produce some noise when there is an issue, and teams will need to create an incident to diagnose and remediate the perceived degradation. You can manually create incidents from a variety of channels, from the mobile app and slack action to the desktop web app. The last channel where incidents may arise is from integrations with existing SIEM or ITSM tools that are again triaging issues to an on-call incident response team.
Below we show another incident detail view showing the incident context, a timeline of incident history, and any active collaboration channels and stakeholders. Related tabs can be seen for child alerts and postmortem.
Alerts and incidents, while both important works for incident response teams are different enough to warrant their own processes. Since not all alerts become incidents, we find the distinction important to reduce noise and improve reporting on the health of your services. Incidents can mean lost revenue, customers, and data. And if they're not handled quickly and effectively, they can turn into full-blown disasters.
Developers, DevOps, and Site Reliability Engineers are under pressure to respond quickly to incidents but often don't have the right tools or processes in place. When you're on-call, it's important to understand the difference between events, alerts, and incidents.
Lightstep Incident Response is the all-in-one platform that enables developers, DevOps, and site reliability engineers to respond quickly and effectively to incidents. Our platform brings together the right people, processes, and tools in one place so you can get your business back up and running fast.
In our next blog, you can read about the nuance between the workflows involved in closing out an alert versus an incident.
Sign up for a free trial of Lightstep Incident ResponseSign up for a free trial of Lightstep Incident Response
Read More About Incident Management
October 5, 2022
7 min read
About the author
Darius KoohmareyRead moreRead more