Lightstep from ServiceNow Logo

Products

Solutions

Developers

Resources

Login

Lightstep from ServiceNow Logo
Technical

Cost of Downtime and How to Calculate It


James Walker

by James Walker

Cost of Downtime and How to Calculate It

Explore more Technical Blogs

Downtime can have a crippling impact on your business, even if you manage to restore services quickly. Many organizations underestimate the true cost of downtime because it can be difficult to comprehend the wide-ranging effects. Outages can cause lost revenue, reputational damage, employee productivity, and regulatory penalties, creating spiraling costs for every minute you’re offline.

In this article, you’ll learn to calculate the true cost of downtime to help understand the long-term impacts of an outage. You’ll also look at some techniques for mitigating the effects of downtime and accelerating service recovery. These can help you cut costs and restore user confidence.

What Does Business Downtime Mean?

Downtime occurs whenever a business is unable to operate its core functions. In the context of technology companies, downtime is usually the result of a software bug, configuration error, or hardware failure that prevents customers or employees, from accessing your services.

Downtime affects businesses differently depending on their industry and operating model:

An e-commerce company loses sales if customers can’t complete an online checkout. Manufacturing firms stop producing if their inventory management system is unavailable. Logistics firms are unable to complete deliveries if packages can’t be scanned in and out of warehouses. Any one of these scenarios can accrue huge costs for the organization after the briefest of outages. The loss of critical systems often leads to a backlog of pending work that could take days or weeks to resolve.

Common Causes of Downtime

Downtime can stem from many types of issues. The following are some of the most common situations that organizations are likely to face.

Software Bugs and Glitches Bugs leading to software crashes are a top cause of downtime. Cloudflare encountered this in July 2019 when a deployment introduced a long-running regular expression that caused massive CPU exhaustion. Users couldn’t access websites deployed behind a Cloudflare proxy as there was no spare capacity.

“Research has shown that as much as 80% of system unavailability is caused by incorrectly applied change. This includes changes made at unauthorized times or without approved change tickets, and can also include approved changes that are not properly executed” (Network World).

Hardware Failure Physical hardware failure can still cause system downtime risks, usually for businesses self-hosting their applications. This is less of an issue for companies using public cloud providers with highly available compute architectures.

Networking Errors Network disruption between services can render individual components inaccessible or take the system offline entirely. Even services hosted in major public clouds aren’t immune: the outage experienced by AWS in December 2021 is an example of an incident where internal networking issues had a knock-on impact on customer workloads.

Misconfiguration Misconfiguration can occur in several forms, from simple incorrect config values that create unexpected behavior, to sub-optimal auto-scaling that ends up making congestion worse.

Power Outages and External Factors External disasters such as power outages, floods, and fires are constant threats. Although datacenters should be adequately protected from this kind of weakness, there’s always a lingering vulnerability. For example, some customers faced unrecoverable data loss when OVHcloud’s Strasbourg datacenter burned down in 2021.

Migration Errors Migrations and upgrades are a common cause of problems. Unforeseen incompatibilities and deployment errors can cause failures in production, even if the system functioned correctly in staging environments. This was the case during TSB’s disastrous transition to a new platform in 2018, which saw customers locked out of their bank accounts for up to two weeks.

Human Error Human error remains a frequent cause of downtime, usually seen in conjunction with one of the other factors on this list. Simple mistakes can cause networking outages, power failures, and software misconfigurations. Facebook’s October 2021 outage began when an engineer unintentionally disabled networking between the company’s datacenters and the internet.

Downtime Cost

Downtime always has a cost, irrespective of the outage’s root cause. The duration of the downtime and the cost incurred per minute you’re offline are the two variables that most affect the financial impact of an outage.

The following formula is the simplest for calculating the cost of a period of downtime:

Cost of Outage = (Minutes of Downtime x Cost per Minute)

The cost per minute will be unique to your organization. The most basic way of computing this value is using the revenue that your online services would generate in a typical minute. For example, if you normally make $10,000 in sales per day, you’re making about $6.90 per minute ($1,000 / (24 hours x 60 minutes)). Consequently, an outage of only 30 minutes has an associated cost of over $200:

Cost of Outage ($207) = (30 x 6.90)

In reality, this formula is too basic for all but the smallest organizations. You also need to account for recovery costs, any reputational damage, and the supplies that are still being consumed during the downtime period. After all, you’ve still got to pay your staffing costs, business rates, and utility bills, even if your organization’s unable to be productive.

A more accurate formula could look like this:

Cost of Outage = (Minutes of Downtime x (Average Sales per Minute + Average Costs per Minute + Contingency for Lost Business Due to Reputational Damage)) + Recovery Cost

Calculating the lost business contingency value can be the hardest part of the formula. You can usually produce a good ballpark figure by looking at the number of clients you acquire in a typical time period, averaging the extra value they bring to your business, and estimating how many would-be leads have been lost due to the outage.

There are methods that can make this formula even more accurate, if you’re prepared to deal with extra complexity. As an example, you could multiply the duration of downtime by a logarithmic coefficient to recognize that longer outages are usually more likely to impact customer acquisitions.

What’s the Average Cost of Downtime?

The cost of downtime varies wildly between industries and individual organizations. It depends on the number of people who’ll be impacted, how instrumental the affected service is to your portfolio, and how quickly you recover.

Gartner estimates that major network outages normally incur costs to the tune of $5,600/minute for organizations operating at an enterprise scale. Delta Air Lines calculated the cost of its five-hour IT meltdown in 2016 at $150 million, illustrating how expensive a relatively short event can prove.

At the other end of the spectrum, a report by IDC for Carbonite found outages cost smaller businesses between $137 and $427 a minute.

The effect on these firms can be particularly acute. The lost revenue and potential regulatory fines create cash flow pressures that can cast doubt on long-term viability. Bringing in money is often the primary objective of these firms, and running out of operating funds is one of the most common reasons for failure.

Minimize Downtime Costs

With unplanned downtime incurring such devastating costs, what can you do to cut your expenses should disaster strike? Although downtime can’t be avoided entirely, acknowledging its existence and planning can help lessen its effects.

Put a Disaster Recovery Plan in Place Establishing a clear disaster recovery plan should be your first step. This plan needs to identify critical components of your system, establish recovery time objectives, and document the processes to follow in the event of an outage.

You should store this document centrally inside your organization’s knowledge hub or operations manual so everyone can access it when the pager pings. It needs to be easily accessible so anyone can quickly retrieve it regardless of the situation that’s being faced.

Open Communication Channels It’s also important to facilitate clear communication between members of your restoration team. Primary and backup communication channels should be available, so you’re not dependent on a single platform. Inaccurate tracking and relaying of key discoveries will hinder your recovery effort and lead to delays that raise your costs.

Address Single Points of Failure Another way to mitigate damage is to redesign your system to eliminate single points of failure. You can increase redundancy by distributing components across different cloud providers and ensuring data is replicated to multiple failover nodes.

If all your application servers connect to a single database instance, you face catastrophic server downtime if that machine fails. Deploying database replicas behind a load balancer would prevent downtime by automatically routing traffic to one of the healthy instances. Writing post mortems after incidents and outages helps your organization identify action items to prevent issues from reoccurring, reducing future outages.

Be Honest with Your Customers Disaster recovery tends to prompt an all-out recovery effort inside the affected organization. This can overshadow the need to communicate regularly with customers. In many recent outages, such as Atlassian’s incident in April 2022, users were kept in the dark about the true severity of the problems.

Admitting an outage can be painful, but failing to inform users promptly creates a greater risk of the company's reputation damage. Providing regular status updates with precise information about the recovery effort can help maintain confidence in your solution, curbing the long-term cost of downtime.

Conclusion

Downtime will always be costly. Service outages jeopardize your platform’s perceived reliability and your company’s sales opportunities, and they are devastating to users. The ability to accurately forecast the true cost of downtime is important to establish meaningful service-level agreements (SLAs) and identify optimal outage mitigation strategies.

In this article, you’ve seen how to calculate the cost of downtime using a simple formula. You’ve also learned some ways to reduce downtime costs, such as eliminating single points of failure and implementing a disaster recovery plan.

To rapidly respond to downtime, you need timely alerts for when a new outage begins. Lightstep is a cloud-native reliability platform that provides a monitoring, observability, and incident response solution for systems at any scale.

Sign up to Lightstep today to monitor performance in real time, identify emerging problems, and keep customers informed of your recovery efforts.

Explore more Technical Blogs