Lightstep from ServiceNow Logo





Lightstep from ServiceNow Logo
< all blogs

Error Budget and Uptime

An error budget refers to the number of errors or amount of degradation that your service can experience over some time before you breach your service level agreement (SLA) or objective (SLO) and compromise user confidence. To understand error budgets you first must understand SLAs and SLOs. Establishing an error budget is an important part of site reliability engineering (SRE), the concept of using data- and software-driven methodologies to enhance operations of production systems. SRE is a field of computer science that combines software engineering and systems engineering principles to handle the availability and scalability needs of modern Internet services. SRE teams are responsible for building, deploying, maintaining, monitoring, and troubleshooting large-scale distributed systems.

Adopting an error budget helps you manage risk and create a more resilient service, maximizing your uptime and user happiness. This means there’s a strong business case for error budgets, in addition to the engineering advantages they provide.

In this article, you’ll learn what error budgets are, how to set them, and what to do when they’re exhausted.

What Are Error Budgets?

Error budgets are the difference between your internal service level objective for a service, and the actual SLA you commit to your customers to give you ‘room for error’. Error budgets are usually built into SLAs, so exceeding one means you’ve breached a contract with your customer. This means you’re probably liable for providing financial compensation and could suffer reputational loss.

Services that come with an SLA tend to include an error budget, even if this is not explicitly acknowledged. As an example, an SLA that stipulates your site must be online for 99.95 percent of a calendar year gives you a total error budget of four hours, twenty-two minutes, and forty-eight seconds. That’s how much time you have to “spend”—by allowing incidents to continue—before you have material consequences on your business.

How Error Budgets Are Spent

In practice, error budgets are usually spent by taking risks. For example, software changes such as database upgrades, legacy system migrations, and major new feature launches are inherently dangerous. Staging environments and deployment rehearsals can’t catch every edge case. An error budget gives you leeway in case of production problems, providing a window where it’s acceptable for downtimedowntime to occur. You can spend the error budget carrying out these operations without compromising your SLA.

Take a simple database schema change: you know it’ll take a few minutes to apply, and there’s a risk it could fail and need a rollback. At the time of deployment, you’ve got two hours of unused error budget available in your measuring period. You estimate this will be ample time to perform the rollback if it’s required. You continue applying the patch.

Then, disaster strikes! The update doesn’t work and you have to use the fallback option. It ends up taking ten minutes to run through the whole procedure, during which time your system’s offline.

Now, you’re left with one hour and fifty minutes of error budget; as you’re still within your SLA, there’s no need to compensate the customer for this outage. Miscategorized errors can also deplete error budgets. For example, a customer might experience slow page load times because of a third-party library that’s taking too long to respond. If this is reported as an error with your service, it will use up valuable error budget that could be spent on more critical issues.

The Purpose of Error Budgets

Error budgets are a neat way to balance continuous innovation with the need for systems to function reliably. Engineering teams are often extremely forward-focused—you want to be building the next big thing, adding value, and bringing in more revenue.

Users see things slightly differently though. They’re unlikely to welcome upgrades that cause frequent instability. A system’s only useful to employees and customers if it functions predictably and is available when they need it.

Error Budgets Inform Development Priorities

Incorporating error budgets into your site reliability plan is a pragmatic way to combine innovation with maximum uptime. Sometimes you’re able to rapidly progress by building new capabilities and improvements. On other occasions, you need to pause and address failures. Error budgets provide data so you can make an informed decision about which side most needs your attention in the present moment.

As such, error budgets should have an active role within your organization. They bridge the gap between development teams and business objectives. The spare budget is an effective measure of how well your service is currently meeting user demands. This information should be utilized by everyone involved in the project, from developers to project leads.

Keep in mind also that error budget depletion isn’t necessarily always due to developers—a sustained high rate of budget usage could indicate the company’s moving too fast and, under pressure, mistakes are being made. Teams that burn through all their error budget are commonly prohibited from releasing new features.

Error Budgets and Developer Autonomy

Error budgets are a way to increase developer autonomy too. Many developers want more freedom to choose how they work and the order in which tasks are completed. Error budgets are a way to provide this while protecting service reliability.

Engineers should be free to spend the available error budget however they want—whether by launching features, performing upgrades, or trialing new experimental systems. Any unallocated budget can be used in pursuit of innovation, without asking operations for permission first. This maximizes throughput and encourages teams to reduce the number of live incidents that occur, as it frees up more time to continue moving forward.

Calculating Your Error Budget

Error budgets are usually calculated in your SLAs. This enables them to be used as a direct indicator of when an outage breaches a contract with a customer.

To work out an SLA and error budget that you can feasibly commit to, one step you can take is to assess the number of incidents you’ve recently experienced, as well as the time it took to restore them.

Error budgets can also be derived from SLOs—your internal uptime objectives—or any other error-related metric that you track.

Calculating by Uptime

Uptime is the most common form of error budget. This measures the time your service is accessible and functioning normally. Many SaaS vendors use uptime as the principal commitment in their SLAs, committing to 99 percent, 99.5 percent, or perhaps even 99.9 percent availability over a given period.

To calculate an error budget from an uptime percentage, simply multiply the total amount of time in the considered period by your percentage value. Here are some common examples.

|SLA % | Annual Error Budget (Downtime) | Monthly Error Budget (Downtime)| |------|--------------------------------|--------------------------------| |99.99% | 52 minutes, 35 seconds | 4 minutes, 23 seconds| |99.95% | 4 hours, 23 minutes | 21 minutes, 54 seconds| |99.90% | 8 hours, 46 minutes | 43 minutes, 49 seconds| |99.75% | 21 hours, 55 minutes | 1 hour, 49 minutes| |99.50% | 43 hours, 50 minutes | 3 hours, 39 minutes| |95.00% | 18 days, 6 hours | 1 day, 13 hours|

You should always compute your downtime budget in hours and minutes before you commit to an SLA. Even small changes in uptime percentages can have a huge impact on the amount of downtime you’re allowed over a year. For example, moving from 99.95 percent to 99.99 percent costs you three and a half hours of your error budget.

Calculating by Failed Requests

Another popular error budget for web-based systems is the number of failed requests. You could define a threshold that creates an incident if more than 1 percent of requests issue an HTTP status code in the 5xx range.

This kind of budget is best measured as a percentage because you can’t predict the actual number of requests that’ll occur over a particular period. It’s dependent on how consistently your service is used.

When Error Budgets Are Breached

Going over your error budget is a serious event. You’ll need to compensate customers if the breached error budget was derived from one of your SLAs. Additionally, your service will have stopped meeting user expectations, putting your reputation on the line.

When this happens, it’s important to take immediate action to address the problem. You’ll have run out of your entire error budget to spend on risks, so all your efforts need to focus on fixing the errors and preventing future regressions, while there is little remaining budget.

Here are some steps you can take when having a little error budget remaining or have gone over your error budget

Freeze New Launches

Development teams need to prioritize patching incident causes ahead of any other work. Therefore, new features and other non-essential improvements should be halted as you approach the limits of the budget.

Operations engineers can guard live environments by freezing deployments in production. Preventing new code from rolling out protects the application against the further risk that would push you over the error budget threshold. This works as an incentive to ensure developers are mindful of the quality and stability of their code in production.

Analyze What Went Wrong

Analyze what consumed the budget after you resolve the incident. Retrospectives can help you discover previously unacknowledged risks, which you can mitigate with further patches. This will make it less likely that the same problem burns through the error budget in the future.

Anticipate the Next Event

This pivot from innovation to service restoration ideally occurs before you actually go over an error budget. You might agree internally that 80 percent error budget consumption is the mark that initiates the shift.

Waiting until the budget’s completely consumed means you won’t be able to take any risks after the incident is resolved, at least not until your SLA rolls over into a new period. This could cause unacceptable delays when you need to ship new features but can’t guarantee a smooth rollout.

Remember the Role of Error Budgets

While feature freezes can be frustrating, remember that choosing to deploy when there’s no available error budget will have tangible repercussions on your organization. If something goes wrong, you’ll have no leeway to resolve the issue as you’ve already breached your commitment to your customer.

More widely, regularly ignoring error budget overages will test your customers’ patience and reduce your platform’s perceived reliability. Error budgets are not simply passive counters, they’re meant to actively guard against business risks.

Final Thoughts

An error budget is a software reliability tool where errors are counted cumulatively until a certain threshold is reached. They’re a way to ensure conformance with agreed-upon SLOs and SLAs while permitting a certain amount of risk to be taken.

The available budget is “spent” by prioritizing new features ahead of error-causing bugs. This acknowledges that errors are inevitable and teams need room to move forward. However, reaching the threshold should motivate the organization to address the issues until the error count recedes and a new budget quota is available. This ensures that reliability problems are fixed promptly, instead of being endlessly deferred.

Setting error budgets requires a platform that can track incidents as they occur while decreasing the downtime and associated usage of your error budget. Lightstep is a cloud-native reliability solution that provides automated monitoring, observability, and incident response functions.

You can set up alerts for new problems and react to them in real-time, helping you detect when error budgets are being consumed. Sign up to Lightstep for freeSign up to Lightstep for free to reduce alert fatigue and orchestrate your incident response.

Sign up to Lightstep Incident ResponseSign up to Lightstep Incident Response today to monitor performance in real time identify emerging problems, and keep customers informed of your recovery efforts 👍🏼

August 26, 2022
10 min read

Share this article

About the author

James Walker

James Walker

Read moreRead more

Monitoring Apache with OpenTelemetry and Lightstep

Andrew Gardner | May 2, 2023

Continue your observability journey by ingesting metrics from Apache and sending them to Lightstep.

Learn moreLearn more

Monitoring MySQL with OpenTelemetry and Lightstep

Andrew Gardner | Apr 11, 2023

Learn how to ingest metrics from MySQL and send them to Lightstep.

Learn moreLearn more

Monitoring NGINX with OpenTelemetry and Lightstep

Robin Whitmore | Apr 6, 2023

Learn how to start ingesting metrics from NGINX and send them to Lightstep for more intelligent analysis and monitoring.

Learn moreLearn more

Lightstep sounds like a lovely idea

Monitoring and observability for the world’s most reliable systems