On-Call Duty: At 2 am, Intuition Isn’t Enough

Terrified, anxious, fearful, isolating, overwhelming, and a little bit thrilling. Those are just a few of the words on-call engineers use to describe what it’s like to be on call for the first time. There are so many things that could go wrong and areas of the system that they don’t fully understand yet. But at some point, it’s time.

On-call duty: the dirty little secret

On-call duty is a dirty little secret in the SaaS world. There’s so much buzz about the promise of a career as an engineer: solving difficult problems, building potentially life-changing applications, writing great code. But during college or grad school, nobody ever talks about being on call.

The skills required for an on-call engineer and what they do “during their day job” are completely different. An engineer may be praised for finding the most thoughtful or elegant solution. However, while on call, it’s all about speed and resolving the issue. It isn’t about finding the best solution because that might take too long. You have to “stop the bleeding” and contain the situation.

SaaS changed the world

Engineers who have been around for a while have seen dramatic changes with the rise of SaaS. Gone are the days of shrink-wrapped software or firmware that’s only updated once or twice a year. Software releases used to be infrequent. If something broke, the engineers would deploy a patch. Remember Patch Tuesday? Back in those days, typically only doctors were paged in the middle of the night; certainly not engineers.

There’s no going back

Life has changed for engineers now, and there’s no going back. Some engineers may be part of an SRE team. You would think those people signed up to be paged in the middle of the night and must have the answers. Unfortunately, they may have deep knowledge of one small part of the service or a particular microservice or set of microservices. Even with that deep knowledge, they may be unfamiliar with services they call into their code, or services their code calls. In a bigger company, they may not know the right contact when they’ve ruled out every hypothesis of what might be wrong with their area of the service.

Preparing to be on call

Being on call requires changing your life, even if you never get paged. Forget going skiing for the weekend or even having a second glass of wine with dinner. You need to have your laptop at the ready and have full possession of your faculties. For some, preparing to be on call can also involve reviewing recent code changes, looking at handoff notes from the person previously on call, and praying for good luck.

For many on-call engineers, preparation includes having a hotspot device, packing extra batteries for tethering, triple checking that they’ve adjusted their phone settings, never being out of cell range, and carrying their laptop – even for that walk around the park. Many on-call rotations require engineers to be able to start looking for answers within 20 – or even 10 – minutes of being paged.

10 am vs. 2 am: timing is everything

Timing is everything when it comes to being on call. The page that comes in at 10 am is completely different than the one that comes in at 2 am. And people are completely different too. Let’s face it, people just aren’t that alert at 2 am if they’ve just been awakened by a page – that is if they’ve been able to sleep at all. The anxiety of possibly getting paged can ruin REM sleep. At 2 am, reading comprehension plummets and deductive reasoning skills aren’t at their sharpest.

At 10 am, you have the luxury of trying to figure things out on your own for a period of time. Even if you want to, you know you should not escalate immediately. However, if you run out of hunches about what could be wrong, you don’t feel that guilty about posting a message in Slack or pinging one of your co-workers who knows a portion of the system better than you. But at 2 am, it’s a different story.

Frantically searching for clues

At 2 am, the page comes in, and you stumble out of bed and grab your laptop as fast as you can. You may see a one-line description of the issue. If this is your first time on call, you may cognitively understand what the alert says but you really don’t know what it means. Has the entire system crashed?

If you’re lucky, you might have a solid playbook you can use. It may be detailed, but you aren’t reading clearly at this point, and you might miss something critical. You’re in a race against time looking for clues. Your blood is pumping and you may feel all alone like you’re about to give a speech to a huge audience and you don’t feel prepared.

You rely on your intuition and start testing your hypothesis. Did something change? What time did the service start having issues, what else happened at that time? You might start looking at graphs, but they may not make any sense to you and you may not know what’s normal or not.

The primary focus is to stop or contain “the bleeding” as fast as humanly possible. For some on-call engineers, this may mean taking a heavy-handed approach. They may restart the service, allocate way more resources to it than it needs, and update the configurations. They’re desperate for a quick fix.

For others, it can be a thrill, especially if they’re adrenaline junkies. They have carte blanche to fix things, and there are no code reviews at 2 am. Nobody knows the answer and most problems are unique in some way. They can be the hero and solve the problem, but they might also make it worse.

Making life easier for on-call engineers

Being on call is a high-stakes, stressful situation. At LightStep, we’re focused on making life easier for on-call engineers. We know what it’s like and understand the impact they can have on their business. We want to provide the clues and insights engineers need to reduce the search space, use real data to test a hypothesis, and “stop the bleeding” as quickly as possible.

Stay tuned for upcoming blog posts about new capabilities from us that will make relying on intuition a thing of the past – and hopefully get you back to sleep faster.

LightStep and OpsGenie Partner to Improve Application Performance and Incident Management

Microservices-based architectures enable software teams to deliver innovations and value to their customers faster. Microservices are often owned by individual engineering teams that are solely responsible for everything from development to deployment. This autonomy reduces cross-team dependencies, but it also often means each development team is solely accountable for the ongoing performance of their own services in production. Using LightStep [x]PM and integrated solutions such as OpsGenie, a leading incident management platform, teams are proactively alerted when potential SLA violations or latency issues occur, and can see the associated end-to-end traces to pinpoint root causes quickly.

LightStep [x]PM is unique because it analyzes completely unsampled trace data and is able to segment this information by extremely high-cardinality key:value tags, such as customer IDs or build numbers. This means [x]PM captures every performance anomaly or failure, no matter how brief or rare the occurrences are. [x]PM is the ideal solution for companies with microservices-based applications, because users can isolate real-time and historical performance data along any dimension and uncover root causes even for complex transactions spanning service boundaries – letting teams focus on the issues they’re responsible for.

LightStep and OpsGenie Partner to Improve Application Performance and Incident Management
LightStep [x]PM with OpsGenie alerts the right people based on on-call schedules when SLA violations or latency issues occur.
Useful information is meaningful only when users receive it when and where they need it. [x]PM has been integrated with complementary DevOps solutions to allow teams to access their performance data in their preferred, existing workflows. Our customers requested that we integrate with the OpsGenie incident management platform for operating always-on services. When a Service-Level Alert (SLA) is violated or resolved, [x]PM sends JSON notifications to OpsGenie, which automatically creates custom alerts and notifies the right people based on on-call schedules – via email, text messages (SMS), phone calls, iOS and Android push notifications, and escalates alerts until they are acknowledged or closed.

We’re excited about the value of this new integration for our customers. We’ll continue to enhance [x]PM to work well with popular tools and DevOps best practices for adopting, developing, and maintaining microservices-based applications.

Try it out and share your feedback at support@lightstep.com, and let us know what other integrations you’d like to see.

Twilio Engineer Shares How They Achieve Five 9s of Availability

In our recent tech talk on SD Times – Managing the Performance of Applications in the Microservices Era – Tyler Wells, Director of Engineering at Twilio, shared his insights on how to effectively manage the performance of microservices-based applications and how they achieve five 9s of availability and success.

Tyler said that integrating new tools and solutions into a developer’s workflow can be a challenge for any organization: there needs to be a big carrot. For Twilio, the carrot was a 92% reduction in mean time to resolution (MTTR) for production incidents, and 70% improvements to mean latency for critical services. Now, they can also detect failures before they impact customers. This article shows how they accomplished these results and how other organizations can do the same.

How Twilio integrated [x]PM into its engineering process and workflow

Tyler described why his team was motivated to try [x]PM and how it fit into their workflow. “Twilio was born and raised in the cloud and has always been built on distributed microservices. My team was an early adopter of LightStep. We were excited about the opportunity to instrument and add tracing to the complex distributed systems we have in the Programmable Video group. You can imagine that setting up a video call involves a lot of steps, and there are a lot of systems. The orchestration messages have to pass through: authorization, authentication, creating the Room [session], orchestrating the Room, adding Participants to the Room. These are all distributed systems, so we added tracing, including Tags and rich information specific to our business, and we started watching. We watched the p99 latency, and we started honing in on the outliers. As we highlighted these outliers, we pulled the information we needed to help identify one of these Rooms using [the Room’s] Sid or GUIDs. We used those IDs to look through [LightStep] and figure out, from the highlighted spans showing the latency, exactly what was going on. That was our first experience with LightStep and how we started to derive value.”

LightStep [x]PM - Managing Application Performance in the Microservices Era

Monitor latency, alert on SLA violations, and focus on the outliers to quickly determine root cause

How chaos actually helps

Tyler talked about the benefits of always assuming that things will break. “We like to break our systems before we put them into the hands of our customers, so we do a lot of Chaos Engineering. We use a tool like Gremlin to start breaking things. LightStep makes it easy for us to be able to hone in on what happens when things go wrong. We know when you’re operating in the cloud, everything is going to break at some point in time. Using LightStep in conjunction with our ‘Game Days,’ we got a ton of visualization, so we could create the SLA alerts, which we have integrated into PagerDuty and Slack. If incidents are triggered, our team immediately shows up in a Slack channel and all of the rich LightStep information is there for us to help identify issues.”

Achieving five 9s of availability and success

Tyler explains how they achieve operational excellence. “We have a program at Twilio called Operational Maturity Model (OMM). It’s a program all teams must follow when pushing product into production. The program has a number of different dimensions: LightStep sits in the Operations dimension. We have a specific policy in the Operations dimension that’s literally called LightStep. There are a number of items in every dimension that teams need to check off to reach a specific grade, with the highest grade being Iron Man. In order for any team to go into production and claim general availability, they have to implement LightStep, use LightStep as part of their Game Days, and they have to achieve Iron Man status. That’s how we use it at Twilio.”

Tyler summarized Twilio’s focus on operational excellence to build customer confidence: “We typically target five 9s [99.999%] of availability and five 9s of success. Generally speaking, five 9s is discipline, not luck.”

Overcoming resistance to change

Tyler described how his team was able to show results and convince other teams at Twilio to use [x]PM. “Any time you try to introduce a new tool to engineers, there’s always going to be some level of resistance. Everybody has more work on their plates and in their backlog than they can handle, and then someone shows up and says: ‘hey, here’s this really cool tool that you should try.’ It’s always met with a healthy dose of skepticism. We had some teams that were early adopters that really derived incredible value from using LightStep. We were able to articulate those results and show other teams (that may have been skeptics). We showed how it helped us solve production-level issues, meet our goals on the operational excellence front, and deliver that higher level of operational maturity to our customers.”

Watch the tech talk, Managing the Performance of Applications in the Microservices Era, to get all of the details about how Twilio is using [x]PM. Don’t miss the demo to see [x]PM in action.