With the release of Change IntelligenceChange Intelligence, we’ve also released the ability to create metrics-based alerts.
With this feature, users will be able to:
Define the metricDefine the metric or a formula (an arithmetic expression of multiple metrics) for the alert
Set the conditions in which an alert is considered violated, based on its evaluation criteria and window
Define the severity of an alert, based on critical or warning thresholds, or no-data violations
Specify notificationsSpecify notifications and re-notify frequency through PagerDutyPagerDuty, SlackSlack, or custom webhooks
Choose to trigger an alert based on an aggregate (using simple alerts) or by specified group parameters (using multi-alerts)
Snooze an alert for a period of time to silence notifications
To power the backend of this feature, we built a new service called Alert Evaluator
. Alert Evaluator handles alert evaluation, deduplication, grouping, delivery, and snoozing of alerts. By this, we define:
Evaluation: “at a basic level, is this alert violating a threshold?”
Deduplication: “how does it prevent accidental duplicate notifications?”
Grouping and delivery: “where, what, and when do we notify?”
Snoozing: “when should we prevent a notification, and what should we send when snoozing is over?”
We faced a number of interesting technical challenges, which are outlined below.
Building Alert Evaluator
While building our first metrics based features in Lightstep, our guiding principle on the backend has been to start with the simplest design, while prioritizing for rapid and flexible feature development. In order to accomplish this with the new Alert Evaluator service, we split alert evaluation into two components: the Evaluator
and the Sender
.
The first component, Evaluator
, handles two main responsibilities: retrieving alert definitions and evaluating them continuously. Evaluator queries our metrics database for each alert evaluation on a ticker, processes the results for on alert violations, and transmits the latest status of the alert to other services.
The second component, Sender
, handles the logic to notify users about their alert. Once the alert has been evaluated, it uses a decision matrix to determine if, when, and where to deliver this notification. This decision matrix needs to take into account the previous and current state of the alert evaluation, ongoing snoozes, and the re-notification interval configured by the user.
At any point in time, an alert is in a specific state, and a transition from one state to another may require the Sender to notify users with a specific message. Some of these transitions and triggered behaviours are described below:
For example, a violated metrics alert that resolves should send a Resolved notification immediately. Or, if an alert remains continuously violated, you can configure your alert to re-notify on a certain interval.
This Sender decision matrix also needs to take into account other features, such as snoozing. For example, if an alert violates while snoozed, it should immediately trigger once the specified snoozing interval is complete. This should happen regardless of the re-notification interval for continual violations.
To simplify alerting state management and solve the problem of duplicate alert delivery, we use an isolated relational database. Thanks to row-level locking capabilities, we were able to further parallelize evaluations and shard the Alert Evaluator service with ease.
Reliable Alert Evaluation
Beyond powering basic metric alert evaluation functionality and extensibility, our goals for Alert Evaluator are to enable rapid evaluation and to gracefully handle upstream and downstream service degradations.
From the moment a metrics-based Lightstep alert is created, it is able to evaluate and check for violations on a 1-minute time ticker. Assuming minimal metric ingestion delays, we evaluate all alerts every 30 seconds, which has been advantageous in catching and resolving alert violations faster than competing metrics products. We’ve seen from our dogfooding of our own metrics alerts that Lightstep-based alerts trigger sooner during an incident, and resolve quicker when incident symptoms subside.
Since we rely on Alert Evaluator to monitor our own critical infrastructure, it was built to gracefully handle slowdowns elsewhere in our system, and reliability was baked into its features since its initial design and implementation as an MVP. Evaluations are run on a jitter to send consistent query loads to our internal time series database, and are configured to intelligently skip sending query loads during downstream service degradation until they recover. We have presented a simplified code snippet below to illustrate how we run this in the evaluation loop:
func (e *evaluation) startEvaluationLoop(evaluationInterval time.Duration) {
e.wg.Add(1)
ctx := context.Background()
go func() {
defer e.wg.Done()
// Smear at a random offset within evaluationInterval, to ensure smooth load.
nextEval := e.clock.Now()
nextEval = nextEval.Add(time.Duration(rand.Float64() * float64(evaluationInterval)))
for {
select {
case <-e.done:
return
case <-e.clock.After(nextEval.Sub(e.clock.Now())):
evaluationTime := e.clock.Now()
// This is where the bulk of evaluation happens. We query MetricDB, process the results, trigger Sender to notify if needed, etc.
err := e.run(ctx, evaluationTime)
if err != nil {
// handle error appropriately, etc.
// change the alert’s status to Unknown
e.changeStatus(ctx, types.StatusUnknown, nil, nil, evaluationTime)
}
nextEval = nextEval.Add(evaluationInterval)
}
// Force nextEval into the future, in case we missed some ticks.
for nextEval.Before(e.clock.Now()) {
nextEval = nextEval.Add(evaluationInterval)
}
}
}()
}
Alert Evaluator pulls the alert definitions and snoozes every 30 seconds from other services; however, it is implemented to continue to operate even amidst upstream service degradation. It maintains its own copy of existing alerts and continues to serve alerts evaluations, even if alert configuration updates are temporarily unavailable.
We hope you enjoy using our Lightstep alerting features. View our Learning Path, Use Change Intelligence from a Metric AlertUse Change Intelligence from a Metric Alert, for step by step instructions on how to find the root cause when alerted to a deviation in your metrics. If you are interested in how Alert Evaluator was powered by our internal time series database, check out the blog post on How we built Lightstep Metrics: creating a database from scratchHow we built Lightstep Metrics: creating a database from scratch. And if you are an engineer who thinks these kinds of problems are interesting, please check out our Careers pageCareers page - we would love to have more people like you!
February 8, 2021
•
5 min read
Observability
About the authors

Stephanie Baum
Read moreRead more
Michelle Lee
Read moreRead moreExplore more articles

How to Operate Cloud Native Applications at Scale
Jason Bloomberg | May 15, 2023Intellyx explores the challenges of operating cloud-native applications at scale – in many cases, massive, dynamic scale across geographies and hybrid environments.
Learn moreLearn more
2022 in review
Andrew Gardner | Jan 30, 2023Andrew Gardner looks back at Lightstep's product evolution and what's in store for 2023.
Learn moreLearn more
The origin of cloud native observability
Jason English | Jan 23, 2023Almost every company that depends on digital capabilities is betting on cloud native development and observability. Jason English, Principal Analyst at Intellyx, looks at the origins of both and their growing role in operational efficiency.
Learn moreLearn moreLightstep sounds like a lovely idea
Monitoring and observability for the world’s most reliable systems