In this blog postWhat Does Business Downtime Mean?What Does Business Downtime Mean?Common Causes of DowntimeCommon Causes of DowntimeSoftware Bugs and GlitchesSoftware Bugs and GlitchesHardware FailureHardware FailureNetworking ErrorsNetworking ErrorsMisconfigurationMisconfigurationPower Outages and External FactorsPower Outages and External FactorsMigration ErrorsMigration ErrorsHuman ErrorHuman ErrorCalculating Downtime CostCalculating Downtime CostWhat’s the Average Cost of Downtime?What’s the Average Cost of Downtime?Minimize Downtime CostsMinimize Downtime CostsPut a Disaster Recovery Plan in PlacePut a Disaster Recovery Plan in PlaceOpen Communication ChannelsOpen Communication ChannelsAddress Single Points of FailureAddress Single Points of FailureBe Honest with Your CustomersBe Honest with Your CustomersFinal ThoughtsFinal ThoughtsSign up to Lightstep todaySign up to Lightstep today
Downtime can have a crippling impact on your business, even if you manage to restore services quickly. Many organizations underestimate the true cost of downtime because it can be difficult to comprehend the wide-ranging effects. Outages can cause lost revenue, reputational damage, employee productivity, and regulatory penalties, creating spiraling costs for every minute you’re offline.
In this article, you’ll learn to calculate the true cost of downtime to help understand the long-term impacts of an outage. You’ll also look at some techniques for mitigating the effects of downtime and accelerating service recovery. These can help you cut costs and restore user confidence.
What Does Business Downtime Mean?
Downtime occurs whenever a business is unable to operate its core functions. In the context of technology companies, downtime is usually the result of a software bug, configuration error, or hardware failure that prevents customers or employees, from accessing your services.
Downtime affects businesses differently depending on their industry and operating model:
An e-commerce company loses sales if customers can’t complete an online checkout. Manufacturing firms stop producing if their inventory management system is unavailable. Logistics firms are unable to complete deliveries if packages can’t be scanned in and out of warehouses. Any one of these scenarios can accrue huge costs for the organization after the briefest of outages. The loss of critical systems often leads to a backlog of pending work that could take days or weeks to resolve.
Common Causes of Downtime
Downtime can stem from many types of issues. The following are some of the most common situations that organizations are likely to face.
Software Bugs and Glitches
Bugs leading to software crashes are a top cause of downtime. Cloudflare encountered this in July 2019 when a deployment introduced a long-running regular expression that caused massive CPU exhaustiona long-running regular expression that caused massive CPU exhaustion. Users couldn’t access websites deployed behind a Cloudflare proxy as there was no spare capacity.
“Research has shown that as much as 80% of system unavailability is caused by incorrectly applied change. This includes changes made at unauthorized times or without approved change tickets, and can also include approved changes that are not properly executed” (Network WorldNetwork World).
Physical hardware failure can still cause system downtime risks, usually for businesses self-hosting their applications. This is less of an issue for companies using public cloud providers with highly available compute architectures.
Network disruption between services can render individual components inaccessible or take the system offline entirely. Even services hosted in major public clouds aren’t immune: the outage experienced by AWS in December 2021 is an example of an incident where internal networking issuesinternal networking issues had a knock-on impact on customer workloads.
Misconfiguration can occur in several forms, from simple incorrect config values that create unexpected behavior, to sub-optimal auto-scaling that ends up making congestion worse.
Power Outages and External Factors
External disasters such as power outages, floods, and fires are constant threats. Although data centers should be adequately protected from this kind of weakness, there’s always a lingering vulnerability. For example, some customers faced unrecoverable data lossunrecoverable data loss when OVHcloud’s Strasbourg data center burned down in 2021.
Migrations and upgrades are common causes of problems. Unforeseen incompatibilities and deployment errors can cause failures in production, even if the system functioned correctly in staging environments. This was the case during TSB’s disastrous transition to a new platform in 2018TSB’s disastrous transition to a new platform in 2018, which saw customers locked out of their bank accounts for up to two weeks.
Human error remains a frequent cause of downtime, usually seen in conjunction with one of the other factors on this list. Simple mistakes can cause networking outages, power failures, and software misconfigurations. Facebook’s October 2021 outage began when an engineer unintentionally disabled networkingunintentionally disabled networking between the company’s data centers and the internet.
Calculating Downtime Cost
Downtime always has a cost, irrespective of the outage’s root cause. The duration of the downtime and the cost incurred per minute you’re offline are the two variables that most affect the financial impact of an outage.
The following formula is the simplest for calculating the cost of a period of downtime:
Cost of Outage = (Minutes of Downtime x Cost per Minute)
The cost per minute will be unique to your organization. The most basic way of computing this value is using the revenue that your online services would generate in a typical minute. For example, if you normally make $10,000 in sales per day, you’re making about $6.90 per minute ($1,000 / (24 hours x 60 minutes)). Consequently, an outage of only 30 minutes has an associated cost of over $200:
Cost of Outage ($207) = (30 x 6.90)
In reality, this formula is too basic for all but the smallest organizations. You also need to account for recovery costs, any reputational damage, and the supplies that are still being consumed during the downtime period. After all, you’ve still got to pay your staffing costs, business rates, and utility bills, even if your organization’s unable to be productive.
A more accurate formula could look like this:
Cost of Outage = (Minutes of Downtime x (Average Sales per Minute + Average Costs per Minute + Contingency for Lost Business Due to Reputational Damage)) + Recovery Cost
Calculating the lost business contingency value can be the hardest part of the formula. You can usually produce a good ballpark figure by looking at the number of clients you acquire in a typical time period, averaging the extra value they bring to your business, and estimating how many would-be leads have been lost due to the outage.
There are methods that can make this formula even more accurate, if you’re prepared to deal with extra complexity. As an example, you could multiply the duration of downtime by a logarithmic coefficientlogarithmic coefficient to recognize that longer outages are usually more likely to impact customer acquisitions.
What’s the Average Cost of Downtime?
The cost of downtime varies wildly between industries and individual organizations. It depends on the number of people who’ll be impacted, how instrumental the affected service is to your portfolio, and how quickly you recover.
Gartner estimates that major network outages normally incur costs to the tune of $5,600/minute for organizations operating at an enterprise scalemajor network outages normally incur costs to the tune of $5,600/minute for organizations operating at an enterprise scale. Delta Air Lines calculated the cost of its five-hour IT meltdowncalculated the cost of its five-hour IT meltdown in 2016 at $150 million, illustrating how expensive a relatively short event can prove.
At the other end of the spectrum, a report by IDC for Carbonite found outages cost smaller businesses between $137 and $427 a minuteoutages cost smaller businesses between $137 and $427 a minute.
The effect on these firms can be particularly acute. The lost revenue and potential regulatory fines create cash flow pressures that can cast doubt on long-term viability. Bringing in money is often the primary objective of these firms, and running out of operating fundsrunning out of operating funds is one of the most common reasons for failure.
Minimize Downtime Costs
With unplanned downtime incurring such devastating costs, what can you do to cut your expenses should disaster strike? Although downtime can’t be avoided entirely, acknowledging its existence and planning can help lessen its effects.
Put a Disaster Recovery Plan in Place
Establishing a clear disaster recovery planclear disaster recovery plan should be your first step. This plan needs to identify critical components of your system, establish recovery time objectives, and document the processes to follow in the event of an outage.
You should store this document centrally inside your organization’s knowledge hub or operations manual so everyone can access it when the pager pings. It needs to be easily accessible so anyone can quickly retrieve it regardless of the situation that’s being faced.
Open Communication Channels
It’s also important to facilitate clear communication between members of your restoration team. Primary and backup communication channels should be available, so you’re not dependent on a single platform. Inaccurate tracking and relaying of key discoveries will hinder your recovery effort and lead to delays that raise your costs.
Address Single Points of Failure
Another way to mitigate damage is to redesign your system to eliminate single points of failureeliminate single points of failure. You can increase redundancy by distributing components across different cloud providers and ensuring data is replicated to multiple failover nodes.
If all your application servers connect to a single database instance, you face catastrophic server downtime if that machine fails. Deploying database replicas behind a load balancer would prevent downtime by automatically routing traffic to one of the healthy instances. Writing post mortems after incidents and outages help your organization identify action items to prevent issues from reoccurring, reducing future outages.
Be Honest with Your Customers
Disaster recovery tends to prompt an all-out recovery effort inside the affected organization. This can overshadow the need to communicate regularly with customers. In many recent outages, such as Atlassian’s incident in April 2022, users were kept in the dark about the true severity of the problems.
Admitting an outage can be painful, but failing to inform users promptly creates a greater risk of the company's reputation damage. Providing regular status updates with precise information about the recovery effort can help maintain confidence in your solution, curbing the long-term cost of downtime.
Downtime will always be costly. Service outages jeopardize your platform’s perceived reliability and your company’s sales opportunities, and they are devastating to users. The ability to accurately forecast the true cost of downtime is important to establish meaningful service-level agreements (SLAs) and identify optimal outage mitigation strategies.
In this article, you’ve seen how to calculate the cost of downtime using a simple formula. You’ve also learned some ways to reduce downtime costs, such as eliminating single points of failure and implementing a disaster recovery plan.
To rapidly respond to downtime, you need timely alerts for when a new outage begins. LightstepLightstep is a cloud-native reliability platform that provides a monitoring, observability, and incident response solution for systems at any scale.