Lightstep from ServiceNow Logo





Lightstep from ServiceNow Logo
< all blogs

Adaptive Paging and the (dreaded) Christmas Tree Effect

We recently sat down with Luis MineiroLuis Mineiro, Head of Site Reliability Engineering at Zalando, during a session on our 99PercentVisible: DevOps Reliability tech talks. Are we on the same page? Let’s fix thatAre we on the same page? Let’s fix that discusses the practice of alerting and how it can go from a routine task to team burn out if not done correctly.

We’ll recap the conversation and do a deep dive on what is known as “The Christmas Tree Effect” and how Adaptive Paging is the correct solution for it.

The Christmas Tree Effect

Hypothetical: Let’s say there’s an outage. Teams across the company have set up alerts to make sure the microservices they are responsible for are covered. Setting up alerts is proactive, right? Unfortunately, this isn’t always the case. Because each team has set up alerts for their own microservices, this just created hundreds, if not thousands of alerts.

Enter: The Christmas Tree Effect.

Adaptive Paging - Christmas Tree Effect

Pagers are now going off everywhere. As Luis says, “It's almost the same [as a Christmas tree], except I'm pretty sure that happiness levels are very different, and I'm very sure no one is getting any presents.”

What's happening? There’s an excessive number of paging alerts leading to a high amount of traction when typically only one team is able to solve the issue. You have now wasted time, created alert fatiguealert fatigue and the teams are burnt out. Sound familiar?

Symptomatic Alerts

Current Time: AlertingAlerting on every microservice creates headaches. (Who knew?) The industry standard now is alerting on symptoms - Symptomatic Alerts. This creates an alert at a location with a good signal to noise ratio based on a controllable and measurable Service Level Objective (SLO)Service Level Objective (SLO).

Let’s pause for a second and break down the word: control. We discussed herediscussed here how stress is, by definition, responsibility without control. Teams need to be able to be paged accurately to maintain their scope of control and responsibility — all while proactively trying to reduce the inevitable stress.

Deep Systems Control vs Responsibility

Is alerting on symptoms the most productive way of reducing stress? Good news: When an SLO is violated, only one team is alerted. Woo! Bad news: That same team will be paged for each and every possible failure in the distributed system. Although this is slightly better, this still leads to the same symptoms of alert fatigue and team burnout, granted on a smaller scale. Alas, stress is still lurking.

Adaptive Paging - Symptomatic Alerts

I know what you're thinking; is there ever going to be a viable solution for alerting? Spoiler alert: there is.

Adaptive Paging

”Adaptive Paging is an alert handler that leverages the causality from tracing and OpenTracing'sOpenTracing's semantic conventions to page the team closest to the problem.”

Say goodbye to The Christmas Tree Effect! From a single alerting rule, a page will be sent, the most probable root cause will be identified, and only the respective team member will receive the page.

Adaptive Paging - Single Page

Here’s a breakdown of the Adaptive Paging steps. After your team sets an SLO, creates the alert, and the alert is triggered, the alert handler:

  • Collects exemplars: A set of traces that are representative of the situation.

  • Checks all child spans starting at the signal of the alert: Starting at the span where the SLO is defined to decide which path to take to the affected operation.

  • Follow path error=true: Complete the recursive process of checking all child spans until an error in a service is found.

  • Sends a single page to the one team responsible for the service.

With this process created by ZalandoZalando, your teams will thank you for the best present you could ever give them — a break from being paged every time an alert is triggered!

For a deeper dive into Adaptive Paging, how it can help your team, as well as the challenges that you still may face with it, head to our YouTube pageYouTube page for the full talk.

Interested in joining our team? See our open positions herehere.

December 9, 2020
4 min read

Share this article

About the author

Lindsay Neeson

Lindsay Neeson

Read moreRead more

How to Operate Cloud Native Applications at Scale

Jason Bloomberg | May 15, 2023

Intellyx explores the challenges of operating cloud-native applications at scale – in many cases, massive, dynamic scale across geographies and hybrid environments.

Learn moreLearn more

2022 in review

Andrew Gardner | Jan 30, 2023

Andrew Gardner looks back at Lightstep's product evolution and what's in store for 2023.

Learn moreLearn more

The origin of cloud native observability

Jason English | Jan 23, 2023

Almost every company that depends on digital capabilities is betting on cloud native development and observability. Jason English, Principal Analyst at Intellyx, looks at the origins of both and their growing role in operational efficiency.

Learn moreLearn more

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems