We recently sat down with Luis MineiroLuis Mineiro, Head of Site Reliability Engineering at Zalando, during a session on our 99PercentVisible: DevOps Reliability tech talks. Are we on the same page? Let’s fix thatAre we on the same page? Let’s fix that discusses the practice of alerting and how it can go from a routine task to team burn out if not done correctly.
We’ll recap the conversation and do a deep dive on what is known as “The Christmas Tree Effect” and how Adaptive Paging is the correct solution for it.
The Christmas Tree Effect
Hypothetical: Let’s say there’s an outage. Teams across the company have set up alerts to make sure the microservices they are responsible for are covered. Setting up alerts is proactive, right? Unfortunately, this isn’t always the case. Because each team has set up alerts for their own microservices, this just created hundreds, if not thousands of alerts.
Enter: The Christmas Tree Effect.
Pagers are now going off everywhere. As Luis says, “It's almost the same [as a Christmas tree], except I'm pretty sure that happiness levels are very different, and I'm very sure no one is getting any presents.”
What's happening? There’s an excessive number of paging alerts leading to a high amount of traction when typically only one team is able to solve the issue. You have now wasted time, created alert fatiguealert fatigue and the teams are burnt out. Sound familiar?
Current Time: AlertingAlerting on every microservice creates headaches. (Who knew?) The industry standard now is alerting on symptoms - Symptomatic Alerts. This creates an alert at a location with a good signal to noise ratio based on a controllable and measurable Service Level Objective (SLO)Service Level Objective (SLO).
Let’s pause for a second and break down the word: control. We discussed herediscussed here how stress is, by definition, responsibility without control. Teams need to be able to be paged accurately to maintain their scope of control and responsibility — all while proactively trying to reduce the inevitable stress.
Is alerting on symptoms the most productive way of reducing stress? Good news: When an SLO is violated, only one team is alerted. Woo! Bad news: That same team will be paged for each and every possible failure in the distributed system. Although this is slightly better, this still leads to the same symptoms of alert fatigue and team burnout, granted on a smaller scale. Alas, stress is still lurking.
I know what you're thinking; is there ever going to be a viable solution for alerting? Spoiler alert: there is.
”Adaptive Paging is an alert handler that leverages the causality from tracing and OpenTracing'sOpenTracing's semantic conventions to page the team closest to the problem.”
Say goodbye to The Christmas Tree Effect! From a single alerting rule, a page will be sent, the most probable root cause will be identified, and only the respective team member will receive the page.
Here’s a breakdown of the Adaptive Paging steps. After your team sets an SLO, creates the alert, and the alert is triggered, the alert handler:
Collects exemplars: A set of traces that are representative of the situation.
Lightstep collects a set of exemplars in the alerting payloadalerting payload for you!
Checks all child spans starting at the signal of the alert: Starting at the span where the SLO is defined to decide which path to take to the affected operation.
error=true: Complete the recursive process of checking all child spans until an error in a service is found.
Sends a single page to the one team responsible for the service.
With this process created by ZalandoZalando, your teams will thank you for the best present you could ever give them — a break from being paged every time an alert is triggered!
For a deeper dive into Adaptive Paging, how it can help your team, as well as the challenges that you still may face with it, head to our YouTube pageYouTube page for the full talk.
Interested in joining our team? See our open positions herehere.
December 9, 2020
4 min read
About the author
Lindsay NeesonRead moreRead more
Explore more articles
The origin of cloud native observabilityJason English | Jan 23, 2023
Almost every company that depends on digital capabilities is betting on cloud native development and observability. Jason English, Principal Analyst at Intellyx, looks at the origins of both and their growing role in operational efficiency.Learn moreLearn more
Gain agility through observabilityHeather Waters | Jan 19, 2023
As we navigate geopolitical challenges, macroeconomic headwinds, and the post-pandemic comedown, there is pressure to drive transformation, reduce costs, and be more efficient. See how observability can help you rise to the challenge and be more agile.Learn moreLearn more