DevOps Best Practices
On-Call Duty: At 2 am, Intuition Isn’t Enough
by Dennis Chu
Terrified, anxious, fearful, isolating, overwhelming, and a little bit thrilling. Those are just a few of the words on-call engineers use to describe what it's like to be on call for the first time. There are so many things that could go wrong and areas of the system that they don't fully understand yet. But at some point, it's time.
On-call duty is a dirty little secret in the SaaS world. There's so much buzz about the promise of a career as an engineer: solving difficult problems, building potentially life-changing applications, writing great code. But during college or grad school, nobody ever talks about being on call.
The skills required for an on-call engineer and what they do "during their day job" are completely different. An engineer may be praised for finding the most thoughtful or elegant solution. However, while on call, it's all about speed and resolving the issue. It isn't about finding the best solution because that might take too long. You have to "stop the bleeding" and contain the situation.
Engineers who have been around for a while have seen dramatic changes with the rise of SaaS. Gone are the days of shrink-wrapped software or firmware that's only updated once or twice a year. Software releases used to be infrequent. If something broke, the engineers would deploy a patch. Remember Patch Tuesday? Back in those days, typically only doctors were paged in the middle of the night; certainly not engineers.
Life has changed for engineers now, and there's no going back. Some engineers may be part of an SRE team. You would think those people signed up to be paged in the middle of the night and must have the answers. Unfortunately, they may have deep knowledge of one small part of the service or a particular microservice or set of microservices. Even with that deep knowledge, they may be unfamiliar with services they call into their code, or services their code calls. In a bigger company, they may not know the right contact when they've ruled out every hypothesis of what might be wrong with their area of the service.
Being on call requires changing your life, even if you never get paged. Forget going skiing for the weekend or even having a second glass of wine with dinner. You need to have your laptop at the ready and have full possession of your faculties. For some, preparing to be on call can also involve reviewing recent code changes, looking at handoff notes from the person previously on call, and praying for good luck.
For many on-call engineers, preparation includes having a hotspot device, packing extra batteries for tethering, triple checking that they've adjusted their phone settings, never being out of cell range, and carrying their laptop – even for that walk around the park. Many on-call rotations require engineers to be able to start looking for answers within 20 – or even 10 – minutes of being paged.
Timing is everything when it comes to being on call. The page that comes in at 10 am is completely different than the one that comes in at 2 am. And people are completely different too. Let's face it, people just aren't that alert at 2 am if they've just been awakened by a page – that is if they've been able to sleep at all. The anxiety of possibly getting paged can ruin REM sleep. At 2 am, reading comprehension plummets and deductive reasoning skills aren't at their sharpest.
At 10 am, you have the luxury of trying to figure things out on your own for a period of time. Even if you want to, you know you should not escalate immediately. However, if you run out of hunches about what could be wrong, you don't feel that guilty about posting a message in Slack or pinging one of your co-workers who knows a portion of the system better than you. But at 2 am, it's a different story.
At 2 am, the page comes in, and you stumble out of bed and grab your laptop as fast as you can. You may see a one-line description of the issue. If this is your first time on call, you may cognitively understand what the alert says but you really don't know what it means. Has the entire system crashed?
If you're lucky, you might have a solid playbook you can use. It may be detailed, but you aren't reading clearly at this point, and you might miss something critical. You're in a race against time looking for clues. Your blood is pumping and you may feel all alone like you're about to give a speech to a huge audience and you don't feel prepared.
You rely on your intuition and start testing your hypothesis. Did something change? What time did the service start having issues, what else happened at that time? You might start looking at graphs, but they may not make any sense to you and you may not know what's normal or not.
The primary focus is to stop or contain "the bleeding" as fast as humanly possible. For some on-call engineers, this may mean taking a heavy-handed approach. They may restart the service, allocate way more resources to it than it needs, and update the configurations. They're desperate for a quick fix.
For others, it can be a thrill, especially if they're adrenaline junkies. They have carte blanche to fix things, and there are no code reviews at 2 am. Nobody knows the answer and most problems are unique in some way. They can be the hero and solve the problem, but they might also make it worse.
Being on call is a high-stakes, stressful situation. At Lightstep, we're focused on making life easier for on-call engineers. We know what it's like and understand the impact they can have on their business. We want to provide the clues and insights engineers need to reduce the search space, use real data to test a hypothesis, and "stop the bleeding" as quickly as possible.
Stay tuned for upcoming blog posts about new capabilities from us that will make relying on intuition a thing of the past – and hopefully get you back to sleep faster.