Splunk report finds downtime still costs big money

Unplanned downtime — whether it involves minor service interruptions or major system outages — extends far beyond technical glitches. Downtime affects the foundation of business operations and profitability, resulting in long-term consequences.

Splunk Inc. analyzed unplanned downtime’s financial and nonfinancial implications by surveying 2,000 executives from some of the world’s largest companies — the Global 2000. The data, collected in partnership with Oxford Economics, was published in a new report titled The Hidden Costs of Downtime.

The report contents were unveiled at the company’s recent .conf user event, revealing that direct downtime costs include lost revenue, regulatory fines, penalties and overtime wages. There are also hidden costs related to customer trust. Let’s dive deeper into the findings.

Financial implications of downtime

Unplanned downtime costs Global 2000 companies an astounding $400 billion each year, which equates to about 9% of their profits. The combined direct and hidden downtime costs impact many aspects of a business. Revenue loss, the most significant cost, averages $4 million annually, with a shocking 75-day recovery period. Regulatory fines average $2 million annually, while missed service level agreement penalties account for $16 million annually.

The 75-day recovery period was interesting, particularly considering Tom Brady’s comments at the recent Cisco Live event. In Brady’s Q&A with Cisco Systems Inc. Chief Executive Chuck Robbins, he discussed the importance of always putting in a full day’s work. If the team didn’t complete what they were required to on Monday and everyone else did, then they were a day behind and could struggle to catch up.

Now consider the impact of a business in a highly competitive industry, such as financial services, being derailed for 75 days while its peers roll on. The company might be able to recover from a single outage event, but once two, three or more happen, it snowballs, and the business is constantly playing catch-up. That’s never a good place to be.

Downtime also affects shareholder value and stock prices, with share price dropping an average of 9% on a reported outage and requiring approximately 79 days to bounce back. Cyberattacks, particularly ransomware, further strain budgets, with 67% of chief financial officers advising their CEOs to pay ransoms. This costs companies $19 million every year. Additionally, downtime hampers innovation, with 74% of technology executives reporting delayed product launches and 64% citing stagnant developer productivity.

Downtime impacts company finances in other ways. According to chief marketing officers, companies spend an average of $14 milion annually on brand trust campaigns to repair their reputations and an additional $13 million on public, investor and government relations programs. CMOs also recognize the significance of minimizing downtime, with 72% stating it’s essential to their role.

Impact on customer and employee trust

When downtime occurs, teams must shift focus from high-value tasks such as launching new products to urgent issues such as applying software patches and conducting postmortems. Downtime also poses personal risks for employees. Some 39% of the respondents reported worrying about being held personally liable, and 38% anticipated a negative impact on performance reviews or job security.

Forty-one percent of the respondents acknowledged that customers often notice downtime before the company does. This damages customer experience, loyalty and public perception, especially when incidents lead to social media backlash. For example, 40% of CMOs said downtime hurts customer lifetime value and relationships with resellers and partners.

Furthermore, downtime has a severe impact on customer relations. Some 29% of companies have lost customers due to downtime, and 44% admitted to downtime damaging their reputation. According to CMOs, it takes about 60 days to restore brand health after an incident.

Causes of downtime

More than half (56%) of downtime incidents are due to security issues such as phishing attacks, while 44% stem from application or infrastructure problems such as software failures. Human error is the primary cause in both scenarios. Rare incidents such as “zero-day” vulnerabilities are more complex to detect given their complexity and lack of standard processes.

System downtime, despite generally high availability, adds up. A typical Global 2000 company faces an average of 466 hours of cybersecurity-related downtime and 456 hours of application or infrastructure downtime annually.

Human error, such as software or infrastructure misconfiguration, often leads to performance issues or security breaches. It takes 17 to 18 hours to detect these errors and an additional 67 to 76 hours to recover. Nearly half of respondents named software failures as frequent outage causes, and 34% cited hardware failures.

Software failure remediation typically takes 16 hours, including root cause analysis and postmortems. Though postmortems are an industry best practice for finding and fixing root causes, they can be challenging and time-consuming without proper tools. Only 42% of tech execs said their company conducts postmortems, often skipping them during low-impact downtimes.

Resilience leaders: a model for success

So what distinguishes some companies from others when mitigating and recovering from downtime? Splunk calls them “resilience leaders.” These companies recover from application and infrastructure downtime 28% faster and 23% faster from cybersecurity incidents. They experience 245 fewer hours of application downtime and 224 fewer hours of security downtime annually.

Resilience leaders suffer least from hidden costs, with most reporting no or only moderate effects. They frequently use generative artificial intelligence tools. They spend significantly more on infrastructure capacity, cyber insurance, backups, cybersecurity tools and observability tools. They often find their current tool spending inadequate, demonstrating a deep awareness of downtime’s broader business impact.

Furthermore, resilience leaders prioritize data management and combining tools, which helps them create more advanced security and monitoring strategies. They understand the financial impact of downtime and make smart investments to avoid it. These companies serve as a model for others, demonstrating that investing strategically and being proactive can significantly reduce downtime and its associated costs.

Recommendations for enhancing resilience

Nearly half of security, information technology and engineering execs find their current levels of downtime unacceptable. The report’s findings underscore that varied budgets, regulations,and regional infrastructure pose downtime-related challenges for companies. Therefore, Spunk recommends following these best practices to enhance resilience:

Develop a downtime plan. Downtime is unavoidable, so it’s important to be prepared with the right procedures and tools. This includes monitoring all apps, clear outage response plans, and assigning specific engineers to handle incidents. Also, regularly conduct tabletop exercises and drills to test system resilience.
Perform postmortems. To prevent recurring issues, perform root cause analysis to identify and fix underlying problems. Invest in observability tools and integrate data from across the environment into a centralized location to isolate root causes.
Safeguard company intellectual property. Protect company intellectual property with a clear data governance policy. Use embedded generative AI features such as domain-specific AI chat assistants to address downtime. These assistants can boost productivity and help enhance employees’ skills.
Integrate teams and tools. Downtime can originate from any source. IT and security operations teams should share tools, data and context to collaborate better, resolve issues faster and identify root causes for quicker recovery.
Adopt a proactive approach. Equip security, IT and engineering teams with AI and machine learning tools for pattern recognition. Predictive analytics can help prevent minor issues from becoming major incidents.