Observability

How we built Lightstep Metrics: production-ready alerting


Stephanie Baum

by Stephanie Baum

Michelle Lee

and Michelle Lee

Explore more Observability Blogs

Stephanie Baum

by Stephanie Baum


Michelle Lee

and Michelle Lee


02-08-2021

Looking for Something?

No results for 'undefined'

With the release of Change Intelligence, we’ve also released the ability to create metrics-based alerts.

Lightstep Metrics Alert

With this feature, users will be able to:

  • Define the metric or a formula (an arithmetic expression of multiple metrics) for the alert
  • Set the conditions in which an alert is considered violated, based on its evaluation criteria and window
  • Define the severity of an alert, based on critical or warning thresholds, or no-data violations
  • Specify notifications and re-notify frequency through PagerDuty, Slack, or custom webhooks
  • Choose to trigger an alert based on an aggregate (using simple alerts) or by specified group parameters (using multi-alerts)
  • Snooze an alert for a period of time to silence notifications

To power the backend of this feature, we built a new service called Alert Evaluator. Alert Evaluator handles alert evaluation, deduplication, grouping, delivery, and snoozing of alerts. By this, we define:

  • Evaluation: “at a basic level, is this alert violating a threshold?”
  • Deduplication: “how does it prevent accidental duplicate notifications?”
  • Grouping and delivery: “where, what, and when do we notify?”
  • Snoozing: “when should we prevent a notification, and what should we send when snoozing is over?”

We faced a number of interesting technical challenges, which are outlined below.

Building Alert Evaluator

While building our first metrics based features in Lightstep, our guiding principle on the backend has been to start with the simplest design, while prioritizing for rapid and flexible feature development. In order to accomplish this with the new Alert Evaluator service, we split alert evaluation into two components: the Evaluator and the Sender.

The first component, Evaluator, handles two main responsibilities: retrieving alert definitions and evaluating them continuously. Evaluator queries our metrics database for each alert evaluation on a ticker, processes the results for on alert violations, and transmits the latest status of the alert to other services.

The second component, Sender, handles the logic to notify users about their alert. Once the alert has been evaluated, it uses a decision matrix to determine if, when, and where to deliver this notification. This decision matrix needs to take into account the previous and current state of the alert evaluation, ongoing snoozes, and the re-notification interval configured by the user.

At any point in time, an alert is in a specific state, and a transition from one state to another may require the Sender to notify users with a specific message. Some of these transitions and triggered behaviours are described below:

Lightstep Metrics - state diagram of basic transition

For example, a violated metrics alert that resolves should send a Resolved notification immediately. Or, if an alert remains continuously violated, you can configure your alert to re-notify on a certain interval.

This Sender decision matrix also needs to take into account other features, such as snoozing. For example, if an alert violates while snoozed, it should immediately trigger once the specified snoozing interval is complete. This should happen regardless of the re-notification interval for continual violations.

Lightstep Metrics - state diagram of snoozed transition

To simplify alerting state management and solve the problem of duplicate alert delivery, we use an isolated relational database. Thanks to row-level locking capabilities, we were able to further parallelize evaluations and shard the Alert Evaluator service with ease.

Reliable Alert Evaluation

Beyond powering basic metric alert evaluation functionality and extensibility, our goals for Alert Evaluator are to enable rapid evaluation and to gracefully handle upstream and downstream service degradations.

From the moment a metrics-based Lightstep alert is created, it is able to evaluate and check for violations on a 1-minute time ticker. Assuming minimal metric ingestion delays, we evaluate all alerts every 30 seconds, which has been advantageous in catching and resolving alert violations faster than competing metrics products. We’ve seen from our dogfooding of our own metrics alerts that Lightstep-based alerts trigger sooner during an incident, and resolve quicker when incident symptoms subside.

Since we rely on Alert Evaluator to monitor our own critical infrastructure, it was built to gracefully handle slowdowns elsewhere in our system, and reliability was baked into its features since its initial design and implementation as an MVP. Evaluations are run on a jitter to send consistent query loads to our internal time series database, and are configured to intelligently skip sending query loads during downstream service degradation until they recover. We have presented a simplified code snippet below to illustrate how we run this in the evaluation loop:

func (e *evaluation) startEvaluationLoop(evaluationInterval time.Duration) {
	e.wg.Add(1)

	ctx := context.Background()
	go func() {
		defer e.wg.Done()

		// Smear at a random offset within evaluationInterval, to ensure smooth load.
		nextEval := e.clock.Now()
		nextEval = nextEval.Add(time.Duration(rand.Float64() * float64(evaluationInterval)))
		for {
			select {
			case <-e.done:
				return
			case <-e.clock.After(nextEval.Sub(e.clock.Now())):
				evaluationTime := e.clock.Now()

		            // This is where the bulk of evaluation happens. We query MetricDB, process the results, trigger Sender to notify if needed, etc.
				err := e.run(ctx, evaluationTime)
				if err != nil {
					// handle error appropriately, etc.

                            // change the alert’s status to Unknown
					e.changeStatus(ctx, types.StatusUnknown, nil, nil, evaluationTime)
				}
				nextEval = nextEval.Add(evaluationInterval)
			}

			// Force nextEval into the future, in case we missed some ticks.
			for nextEval.Before(e.clock.Now()) {
				nextEval = nextEval.Add(evaluationInterval)
			}
		}
	}()
}

Alert Evaluator pulls the alert definitions and snoozes every 30 seconds from other services; however, it is implemented to continue to operate even amidst upstream service degradation. It maintains its own copy of existing alerts and continues to serve alerts evaluations, even if alert configuration updates are temporarily unavailable.

We hope you enjoy using our Lightstep alerting features. View our Learning Path, Use Change Intelligence from a Metric Alert, for step by step instructions on how to find the root cause when alerted to a deviation in your metrics. If you are interested in how Alert Evaluator was powered by our internal time series database, check out the blog post on How we built Lightstep Metrics: creating a database from scratch. And if you are an engineer who thinks these kinds of problems are interesting, please check out our Careers page - we would love to have more people like you!

Explore more Observability Blogs