Lightstep from ServiceNow Logo

Products

Solutions

Documentation

Resources

Lightstep from ServiceNow Logo
< all blogs

On-Call Schedules: Striking the Perfect Balance

DevOps, and Site Reliability Engineers - your prayers have been answered! 🙏🏼

With Lightstep Incident Response, you can finally orchestrate all your on-call schedules, precise notifications, smart alert grouping, and automated incident response in one place. Finally, everything is under control.

While engineers have continued to develop ways to make our systems more available and robust, downtime is still an unfortunate reality for just about every product or service. For example, your service might experience server timeouts due to a sudden, unexpected rise in traffic.

To mitigate these issues and minimize overall downtimedowntime, it's important for teams to always have a designated on-call engineer who can quickly respond to alerts. This is where on-call schedules come in handy.

The goal of an on-call schedule is to ensure 24/7 coverage of your service while spreading the operation load across your team as evenly as possible. This is easy to say, but hard in practice.

Creating an on-call schedule is often more nuanced than simply putting all your engineers on a weekly rotation. While that may work for some teams, it may not work for your specific product or organization. In addition, you’ve got to be extra cautious to avoid causing employee burnout with your on-call scheduling.

The high-availability requirements of today's products and systems generally require 24/7 support, so on-call schedules are here to stay. But how do you build an on-call schedule in the face of all these challenges?

In this article, you'll learn about general on-call schedule concepts and strategies for dealing with common challenges. You'll also see an example of how LightstepLightstep makes it easy for you to quickly create an effective on-call rotation for your team.

Understanding On-Call Scheduling

Before getting into building your own on-call schedule, there are a few key concepts that you need to know and consider.

Rotations

On-call schedules are also often referred to as rotations because shifts typically rotate through a number of contacts in a cyclical format.

Primary and Backup On-Calls

For every on-call shift, there's always a primary on-call. The primary on-call is the first point of contact for incidents and alerts; they often carry a pager to get notified of such alerts.

While optional, many teams find it useful to also have a backup on-call (or secondary on-call) for each shift as well. The backup supports the primary operations tasks throughout the shift. In addition, the backup is paged in on an incident if the primary misses a notification.

Service-Level Agreements

In your on-call rotations, it's crucial to define the SLA, short for service-level agreement. Most commonly, this refers to the amount of time that the primary on-call has to check in to a new alert (eg, thirty minutes). If the primary misses the SLA, the backup on-call may be paged.

Escalations

If an on-call engineer needs more help with an incident, they’ll usually escalate the issueescalate the issue up the management chain.

This brings in on-call managers who may have more visibility into an issue, and they can page in other domain-specific engineers if necessary. It's also common for an on-call alert system to have automated escalation paths when detecting that SLAs aren't being met.

Overrides

Sometimes, things come up in your team’s personal lives that affect their availability. It's unrealistic to expect all your team members to always be available whenever their on-call shift comes up. So, any good on-call scheduling system should be able to make overrides to future shifts, so that engineers can cover for each other if needed.

Creating an On-Call Schedule Using Lightstep Incident Response

Lightstep Incident ResponseLightstep Incident Response supports all the essential on-call rotation features listed above. Using Lightstep, teams can easily create and manage their own on-call schedules. This section will demonstrate how to quickly set one up.

In Lightstep, you must first set up a team. This example starts out with a three-member team, though your rotation should definitely have more people.

A screenshot of Lightstep Incident Response showing a basic team overview.

Under the On-call schedule tab, the team manager, in this case, Alexander Yu, can easily set up recurring on-call shifts. Suppose that this team owns a service that has seen historically high traffic on Mondays and Tuesdays, medium traffic on Wednesdays through Fridays, and low traffic over the weekends.

The team might then decide to create three separate shifts throughout the week to cover each of these different traffic patterns. Using Lightstep, the team can then easily visualize their on-call schedule.

A screenshot of Lightstep Incident Response showing a visualization of an on-call schedule featuring three distinct shifts

At the top, you can see the Total coverage, which gives an overview of who’s covering for the service at any given time. From this view, users can also add overrides for any particular shift or day.

For example, suppose Alexander Yu is unable to do his shift on Monday, July 11th. By clicking on his name in the Total coverage box, he can select Add coverage and choose someone else to cover for him. In the image below, notice that Jane Doe has overridden Alexander Yu's original shift on Monday.

A screenshot of Lightstep Incident Response showing an on-call shift override on Monday.

Lightstep also supports adding custom SLAs and escalation policies. For example, for a high-severity incident, you might define an escalation policy as follows.

A screenshot of an escalation policy for a P1 incident in Lightstep Incident Response.

As you can see, creating an on-call schedule with Lightstep is easy. While this is a great first step, there are many ways you can improve this rotation.

Lightstep Incident Response has available pre-built on-call schedule templates allowing you to quickly set up on-call schedules for your teamset up on-call schedules for your team, which include:

  • Follow the sun

  • Week/Weekend

  • Off-hours

Of course, you still have the ability to set up custom on-call schedules as well.

A screenshot of the Lightstep Incident Response app showing where to create an on-call schedule

Common Challenges with On-Call Schedules

Creating your on-call rotation is just the beginning. To identify how to improve on the basic schedule, you can look at some of the most common challenges organizations face and consider how to adjust your rotations accordingly.

Not Adapting Your On-Call Schedule to Your Team's Needs

There is no one-size-fits-all approach to on-call scheduling. Each team will need a slightly different schedule to best handle its operational needs. Product maturity, user volume, team size, and overall morale are just some of the factors that can influence on-call schedules.

In the on-call schedule we created earlier, we had three separate shifts during the week for high, medium, and low traffic periods. However, after a few weeks, you might decide that it makes more sense to turn this into a basic weekly rotation instead.

Conversely, if your team currently uses a weekly rotation but engineers are getting swamped with issues, you might want to break that shift up into two, or add in a backup on-call.

Not Having Proper On-Call Hand-Offs

When creating an on-call schedule, make sure you consider on-call hand-offs. Good on-call schedules typically rotate during office hours, when the engineers involved in the rotation are online at the same time. This allows engineers to communicate on key issues and patterns that they observed during their shifts.

Ideally, your team already holds operations meetings to discuss key customer issues and any unusual spikes in service metrics. An ops meeting would be the perfect time to mark the end of one on-call shift, allowing the current on-call to present their findings to the entire team. Conveniently, this meeting can also mark the start of the next on-call shift, and the cycle can repeat as necessary.

In our Lightstep schedule, instead of having shifts rotate at midnight, they should rotate sometime during regular business hours. For example, if the team has weekly ops meetings on Wednesdays at 3 p.m., it might make sense for the shift to end Wednesdays at noon. This also gives the current on-call ample time to prepare an on-call report

Relying Too Heavily on a Small Group of Engineers

Small rotations, like the one we created in Lightstep, can end up being extremely burdensome on a team. To avoid relying too heavily on a small group of engineers for on-call tasks, consider expanding or combining rotations to spread the workload more evenly.

For example, if there were two sister teams in an organization with rotations of three to five engineers, it might make sense to combine these into a six to ten-person rotation.

Alert fatigueAlert fatigue is an equally important problem; it occurs when your on-calls get so many alerts that they get desensitized to new ones. When alert fatigue gets bad, engineers may inadvertently miss SLAs or ignore alerts altogether. Engineers that are on-call too frequently run a higher risk of alert fatigue.

Having a Poor Alert System

Another common cause of alert fatigue is a poor alert system. While not directly related to the scheduling aspect of on-call rotations, you should ensure that your on-calls only get paged when absolutely necessary. An excessive amount of false alarms can quickly lead to alert fatigue, which can cause engineers to miss real alerts when they arise.

You can remedy this problem by revisiting your alarm thresholds during each operations meeting. If you have a metric that consistently breaches a threshold, it's worth an investigation to figure out why that's occurring. If it's deemed to be nothing too serious (eg, it's simply due to a sustained increase in user traffic), you might choose to simply raise the threshold.

You might also decide to tweak the severity levelseverity level (eg, P1 issues are the most urgent, while P2 issues are less urgent) for particular alerts.

For example, suppose your service tracks latency for a particular API operation, which has an alarm threshold of 200 ms. If the latency breaches this threshold three times in a five-minute window, you might configure your system to send a lower-severity P2 issue to the on-call engineer, which doesn't trigger a pager alert. However, if you see ten latency breach instances within a five-minute period, this may trigger a high-severity P1 issue, which will page the on-call.

In both instances, the system notifies the on-call about latency issues that need to be examined, but they occur at different severity levels; this helps combat alert fatigue.

Not Providing Proper Training for Operations Tasks

It sounds like a no-brainer, but on-call engineers need to feel prepared to handle their shifts. A common mistake some organizations make is adding newer engineers into the rotation too early into their tenure with the team. Without sufficient training, on-calls may not know how to properly handle alerts or escalate issues.

Teams often employ on-call shadowing to ease new engineers into the on-call rotation. Shadow on-calls are different from secondary on-calls. When an engineer is assigned to be a shadow on-call, they receive all the same alerts as the primary on-call. As the primary continues with their regular shift, they’re also actively mentoring the shadow and guiding them through tackling each issue.

After an engineer has gone through one or two shadow rotations, they should also do a reverse shadow rotation, where they take the lead on issues with guidance from the primary.

Employee Burnout

It sounds like a no-brainer, but on-call engineers need to feel prepared to handle their shifts. A common mistake some organizations make is adding newer engineers into the rotation too early into their tenure with the team. Without sufficient training, on-calls may not know how to properly handle alerts or escalate issues.

Teams often employ on-call shadowing to ease new engineers into the on-call rotation. Shadow on-calls are different from secondary on-calls. When an engineer is assigned to be a shadow on-call, they receive all the same alerts as the primary on-call. As the primary continues with their regular shift, they’re also actively mentoring the shadow and guiding them through tackling each issue.

After an engineer has gone through one or two shadow rotations, they should also do a reverse shadow rotation, where they take the lead on issues with guidance from the primary.'

Take a look at our article about on-call policies and managementon-call policies and management to learn more about how on-call management eliminates the risk of incident notifications being missed and addressed in a timely manner.

Building an Effective On-Call Schedule

On-call schedules are a vital part of any organization that provides software services. By implementing an on-call schedule that strikes the proper balance between availability and employee happiness, you can ensure that issues are fixed quickly, yet don't lead to employee burnout.

Best Practices

Benefits of Effective on-call rotation

  • On-call engineers feel prepared to handle their shifts

  • New engineers are eased into the on-call rotation

  • On-call employees are less likely to experience burnout and maintain a positive work-life balance

  • On-calls receive alerts at different severity levels

  • Customers are happier being assured that urgent issues will be addressed in a timely manner.

Final Thoughts

Prevent incidents before they happen and recover quickly when they do. Lightstep Incident ResponseLightstep Incident Response provides a simple user interface where teams can quickly create their on-call schedule.

As your team grows, your on-call schedules will need tweaking as well. Organizations that don't adapt their scheduling over time can cause their engineers to soon burn out. After studying some of the common problems presented in this article, hopefully, you're now well equipped to create and maintain an effective on-call schedule for your team.

Sign up for a free trial of Lightstep Incident ResponseSign up for a free trial of Lightstep Incident Response

September 1, 2022
12 min read
Technical

Share this article

About the author

Alexander Yu

Alexander Yu

Read moreRead more

Monitoring Apache with OpenTelemetry and Lightstep

Andrew Gardner | May 2, 2023

Continue your observability journey by ingesting metrics from Apache and sending them to Lightstep.

Learn moreLearn more

Monitoring MySQL with OpenTelemetry and Lightstep

Andrew Gardner | Apr 11, 2023

Learn how to ingest metrics from MySQL and send them to Lightstep.

Learn moreLearn more

Monitoring NGINX with OpenTelemetry and Lightstep

Robin Whitmore | Apr 6, 2023

Learn how to start ingesting metrics from NGINX and send them to Lightstep for more intelligent analysis and monitoring.

Learn moreLearn more
THE CLOUD-NATIVE RELIABILITY PLATFORM

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems