Adaptive Paging and the (dreaded) Christmas Tree Effect
by Lindsay Neeson
We recently sat down with Luis Mineiro, Head of Site Reliability Engineering at Zalando, during a session on our 99PercentVisible: DevOps Reliability tech talks. Are we on the same page? Let’s fix that discusses the practice of alerting and how it can go from a routine task to team burn out if not done correctly.
We’ll recap the conversation and do a deep dive on what is known as “The Christmas Tree Effect” and how Adaptive Paging is the correct solution for it.
Hypothetical: Let’s say there’s an outage. Teams across the company have set up alerts to make sure the microservices they are responsible for are covered. Setting up alerts is proactive, right? Unfortunately, this isn’t always the case. Because each team has set up alerts for their own microservices, this just created hundreds, if not thousands of alerts.
Enter: The Christmas Tree Effect.
Pagers are now going off everywhere. As Luis says, “It's almost the same [as a Christmas tree], except I'm pretty sure that happiness levels are very different, and I'm very sure no one is getting any presents.”
What's happening? There’s an excessive number of paging alerts leading to a high amount of traction when typically only one team is able to solve the issue. You have now wasted time, created alert fatigue and the teams are burnt out. Sound familiar?
Current Time: Alerting on every microservice creates headaches. (Who knew?) The industry standard now is alerting on symptoms - Symptomatic Alerts. This creates an alert at a location with a good signal to noise ratio based on a controllable and measurable Service Level Objective (SLO).
Let’s pause for a second and break down the word: control. We discussed here how stress is, by definition, responsibility without control. Teams need to be able to be paged accurately to maintain their scope of control and responsibility — all while proactively trying to reduce the inevitable stress.
Is alerting on symptoms the most productive way of reducing stress? Good news: When an SLO is violated, only one team is alerted. Woo! Bad news: That same team will be paged for each and every possible failure in the distributed system. Although this is slightly better, this still leads to the same symptoms of alert fatigue and team burnout, granted on a smaller scale. Alas, stress is still lurking.
I know what you're thinking; is there ever going to be a viable solution for alerting? Spoiler alert: there is.
”Adaptive Paging is an alert handler that leverages the causality from tracing and OpenTracing's semantic conventions to page the team closest to the problem.”
Say goodbye to The Christmas Tree Effect! From a single alerting rule, a page will be sent, the most probable root cause will be identified, and only the respective team member will receive the page.
Here’s a breakdown of the Adaptive Paging steps. After your team sets an SLO, creates the alert, and the alert is triggered, the alert handler:
- Collects exemplars: A set of traces that are representative of the situation.
- Lightstep collects a set of exemplars in the alerting payload for you!
- Checks all child spans starting at the signal of the alert: Starting at the span where the SLO is defined to decide which path to take to the affected operation.
- Follow path
error=true: Complete the recursive process of checking all child spans until an error in a service is found.
- Sends a single page to the one team responsible for the service.
With this process created by Zalando, your teams will thank you for the best present you could ever give them — a break from being paged every time an alert is triggered!
For a deeper dive into Adaptive Paging, how it can help your team, as well as the challenges that you still may face with it, head to our YouTube page for the full talk.
Interested in joining our team? See our open positions here.