Lightstep from ServiceNow Logo

Products

Solutions

Documentation

Resources

Lightstep from ServiceNow Logo
< all blogs

Adaptive Paging and the (dreaded) Christmas Tree Effect

We recently sat down with Luis MineiroLuis Mineiro, Head of Site Reliability Engineering at Zalando, during a session on our 99PercentVisible: DevOps Reliability tech talks. Are we on the same page? Let’s fix thatAre we on the same page? Let’s fix that discusses the practice of alerting and how it can go from a routine task to team burn out if not done correctly.

We’ll recap the conversation and do a deep dive on what is known as “The Christmas Tree Effect” and how Adaptive Paging is the correct solution for it.

The Christmas Tree Effect

Hypothetical: Let’s say there’s an outage. Teams across the company have set up alerts to make sure the microservices they are responsible for are covered. Setting up alerts is proactive, right? Unfortunately, this isn’t always the case. Because each team has set up alerts for their own microservices, this just created hundreds, if not thousands of alerts.

Enter: The Christmas Tree Effect.

Adaptive Paging - Christmas Tree Effect

Pagers are now going off everywhere. As Luis says, “It's almost the same [as a Christmas tree], except I'm pretty sure that happiness levels are very different, and I'm very sure no one is getting any presents.”

What's happening? There’s an excessive number of paging alerts leading to a high amount of traction when typically only one team is able to solve the issue. You have now wasted time, created alert fatiguealert fatigue and the teams are burnt out. Sound familiar?

Symptomatic Alerts

Current Time: AlertingAlerting on every microservice creates headaches. (Who knew?) The industry standard now is alerting on symptoms - Symptomatic Alerts. This creates an alert at a location with a good signal to noise ratio based on a controllable and measurable Service Level Objective (SLO)Service Level Objective (SLO).

Let’s pause for a second and break down the word: control. We discussed herediscussed here how stress is, by definition, responsibility without control. Teams need to be able to be paged accurately to maintain their scope of control and responsibility — all while proactively trying to reduce the inevitable stress.

Deep Systems Control vs Responsibility

Is alerting on symptoms the most productive way of reducing stress? Good news: When an SLO is violated, only one team is alerted. Woo! Bad news: That same team will be paged for each and every possible failure in the distributed system. Although this is slightly better, this still leads to the same symptoms of alert fatigue and team burnout, granted on a smaller scale. Alas, stress is still lurking.

Adaptive Paging - Symptomatic Alerts

I know what you're thinking; is there ever going to be a viable solution for alerting? Spoiler alert: there is.

Adaptive Paging

”Adaptive Paging is an alert handler that leverages the causality from tracing and OpenTracing'sOpenTracing's semantic conventions to page the team closest to the problem.”

Say goodbye to The Christmas Tree Effect! From a single alerting rule, a page will be sent, the most probable root cause will be identified, and only the respective team member will receive the page.

Adaptive Paging - Single Page

Here’s a breakdown of the Adaptive Paging steps. After your team sets an SLO, creates the alert, and the alert is triggered, the alert handler:

  • Collects exemplars: A set of traces that are representative of the situation.

  • Checks all child spans starting at the signal of the alert: Starting at the span where the SLO is defined to decide which path to take to the affected operation.

  • Follow path error=true: Complete the recursive process of checking all child spans until an error in a service is found.

  • Sends a single page to the one team responsible for the service.

With this process created by ZalandoZalando, your teams will thank you for the best present you could ever give them — a break from being paged every time an alert is triggered!

For a deeper dive into Adaptive Paging, how it can help your team, as well as the challenges that you still may face with it, head to our YouTube pageYouTube page for the full talk.

Interested in joining our team? See our open positions herehere.

December 9, 2020
4 min read
Observability

Share this article

About the author

Lindsay Neeson

Lindsay Neeson

Read moreRead more
Observability

The origin of cloud native observability

Jason English | Jan 23, 2023

Almost every company that depends on digital capabilities is betting on cloud native development and observability. Jason English, Principal Analyst at Intellyx, looks at the origins of both and their growing role in operational efficiency.

Learn moreLearn more
Observability

Gain agility through observability

Heather Waters | Jan 19, 2023

As we navigate geopolitical challenges, macroeconomic headwinds, and the post-pandemic comedown, there is pressure to drive transformation, reduce costs, and be more efficient. See how observability can help you rise to the challenge and be more agile.

Learn moreLearn more
Observability

Developing a culture of observability

Doug Odegaard | Jan 4, 2023

Businesses must deliver remarkable customer experiences, release reliable products fast, and reduce costs to achieve consistent growth. See how observability can help.

Learn moreLearn more
THE CLOUD-NATIVE RELIABILITY PLATFORM

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems