A major IT outage is often dismissed as an unlikely event—something that simply “isn’t going to happen” or “won’t occur on such a scale.” However, the recent massive outage caused by a faulty CrowdStrike update has proven otherwise, echoing the unpreparedness witnessed during the early days of the COVID-19 pandemic. This incident, which grounded flights, cancelled hospital appointments, and disrupted banking systems worldwide, serves as a stark reminder of the fragility of our digital infrastructure. Although it only affected about 1% of Windows computers, its impact was global, illustrating that not all significant IT disruptions need to be cybercrime-induced to cause widespread chaos.
But amidst the chaos, there’s a critical opportunity: to leverage such disasters to raise awareness about the importance of robust Business Continuity Management (BCM). Even if your organisation wasn’t directly impacted, there’s immense value in learning from these events. Observing and understanding the failures of others can often provide more practical insights than experiencing them firsthand. This blog delves into how turning failures and events into opportunities can help you enhance your BCM strategies and safeguard your business against future disruptions.
Understanding the impact
On July 19, 2024, a botched software update at cybersecurity firm CrowdStrike triggered widespread IT chaos across the globe. The disruption began in Australia and rapidly spread through Asia, Europe, and the Americas. The travel industry, in particular, was severely impacted, with airlines, airports, and other critical services experiencing major outages.
CrowdStrike, a leading cybersecurity firm with thousands of global customers, became the epicentre of this chaos. The issue was traced back to a faulty update to their Falcon sensor product, which is designed to prevent cyberattacks. This update contained problematic content that caused Windows systems to crash, resulting in a continuous boot loop where devices repeatedly restarted without completing their boot process. Despite a swift response to roll back the update, the damage was extensive, affecting approximately 8.5 million Windows devices worldwide and disrupting numerous organisations throughout the supply chain.
This incident has had profound repercussions for CrowdStrike. While the company’s financial liability is limited due to the software industry’s licensing structure and existing insurance policies, the impact on its reputation and customer trust has been significant. The outage led to a substantial drop in CrowdStrike’s sales and revenues, underscoring the company’s vulnerability to large-scale technical failures. On Thursday, CrowdStrike Chief Executive George Kurtz reported that over 97% of Windows sensors were back online, but only after customers faced considerable losses from business interruptions, downtime, and operational delays. According to Parametrix, an analytics and insurance provider, the financial impact of the outage on Fortune 500 companies was estimated at $5.4 billion, with insurance policies likely covering only 10% to 20% of those losses. Moreover, the chaos provided an opportunity for threat actors to exploit the situation, using the downtime to launch phishing and social engineering attacks. National cybersecurity agencies have issued alerts and advisories to warn organizations of potential follow-on threats.
CrowdStrike’s spokesperson stated, “CrowdStrike’s top priority continues to be on our customers and restoring every impacted system.” This incident serves as a crucial reminder that no entity is immune to cyber threats, regardless of their reputation or stature. Moving forward, companies and organisations of all stripes must prioritise effective cybersecurity measures and disaster management strategies, recognising that sophisticated technology can still be compromised. This outage is a sobering example of the importance of robust precautionary measures and the need for resilient operational strategies in the face of potential technological failures.
Could Business Continuity Management (BCM) have prevented this?
While CrowdStrike’s BCM may not have prevented the issue—given that the root cause was related to change management rather than BCM processes—the BCM at affected organisations could have mitigated the impact. Understanding your third-party providers and their roles within your processes is crucial. This awareness allows organisations to assess how changes or failures in these third-party services could affect their operations. Conducting thorough Business Continuity Analysis and developing emergency plans for critical resources can help ensure continuity even when key providers fail.
The CrowdStrike incident highlights the vulnerability of relying heavily on a single provider. Despite the outage being within CrowdStrike’s SLA of 99.9% availability, the disruption was significant. Organisations need to proactively prepare for IT failures, as well as non-IT failures. Key steps for robust BCM include:
- Know Your Third-Party Providers: Understand what each provider does for you and their role in your processes.
- Conduct Business Impact Analysis: Determine the impact and maximum tolerable disruption time, as well as your minimum business continuity objectives and how to achieve them.
- Know Your Critical Resources: Identify critical resources needed for essential activities and ensure you have plans to manage their failure.
- Develop Emergency Plans: Ensure you have plans in place to continue operations even when critical resources fail.
Implementing these measures requires balancing benefits with potential impacts. For instance, CrowdStrike customers might consider having an independent, similar solution in place to manage critical tasks during system outages.
The importance of learning from failures.
The CrowdStrike incident underscores the importance of leveraging failures as learning opportunities. By examining the causes and impacts of such events, organisations can enhance their resilience. Learning from others’ mistakes is often more beneficial than experiencing them firsthand, as it can help prevent future issues.
In the aftermath of the outage, CrowdStrike and Microsoft have been working diligently to address the problem. A preliminary review revealed that the issue stemmed from a memory safety error in CrowdStrike’s CSagent driver. Microsoft’s analysis highlighted the risks of granting third-party software kernel-level access but also emphasised the security benefits, such as early detection of boot kits and root kits. However, they acknowledged the trade-offs and are moving some core services from kernel to user mode to mitigate risks. To prevent future incidents, Microsoft is enhancing their security update processes and promoting a zero-trust approach with high-integrity attestation. With over 97% of affected PCs now back online, Microsoft is committed to improving end-to-end resilience, as emphasized by John Cable, VP of Windows Program Management. This incident has underscored the importance of balancing security needs with operational risks and the ongoing need for robust, resilient systems.
Summary
In the wake of the unprecedented outage caused by CrowdStrike’s faulty update, the incident has served as a critical reminder of the fragility and interconnectedness of our digital infrastructure. This failure, which disrupted flights, hospital appointments, and banking systems globally, underscores the critical need for robust Business Continuity Management. While some organisations had contingency plans in place, the incident revealed that not all were adequately prepared, stressing the importance of thorough business impact analyses and emergency plans.
The CrowdStrike incident, rooted in a preventable error, has affected over 8.5 million Windows computers and led to significant global disruption. As Microsoft and CrowdStrike work to resolve the issue, the event serves as a crucial lesson in the need for proactive BCM, understanding third-party dependencies, and preparing for potential failures to ensure operational resilience.
Looking ahead, we should start to recognise that other major cybersecurity and tech companies, such as Fastly, NATS, ICANN, Verisign, and Amazon Web Services, hold similar potential to trigger global disruptions. These entities play crucial roles in our digital ecosystem, and their failures could lead to widespread chaos. Therefore, organisations must prioritise BCM next to Cybersecurity, understand their critical dependencies, and prepare for potential failures to ensure operational resilience. The recent outage serves as a stark lesson and an opportunity to bolster our preparedness for future disruptions.
At fernao business resilience, recognise the complexities of today’s digital landscape and the importance of being prepared for any IT or non-IT related disruptions. Our expert team specialises in developing comprehensive BCM strategies tailored to your organisation’s unique needs. From business impact analyses and risk assessments to creating robust emergency response plans and implementing alternative solutions, we are here to help you safeguard your business against future disruptions. Contact us today to learn how we can assist you in turning potential failures into opportunities for growth and strength.
Image by RNZ