CROWDSTRIKEの「サービス停止」: テクノロジー戦略家への教訓
The CrowdOut incident highlights how vulnerable systems are to disruption—not only by cyber attacks but also by friendly fire.
After seven days, the dust is finally settling on the CrowdStrike outage. While some companies are still struggling to recover from the incident, most have regained their footing—and the use of their computer systems. Losses stemming from the incident are estimated at just over $1 billion for the banking industry globally (Perametrix). While a billion in losses is nothing to sneeze at, it does give the sense of disaster averted—considering CrowdStrike’s extensive client base, the repercussions might have been far worse (as it was for the airline industry).
Things look far grimmer for insurers, not so much due to service outages impacting their client-facing business but for payouts on cyber insurance policies. Estimates of eventual insurance losses due to claims stemming from CrowdOut range from $400 million (CyberCube) to a whopping $10 billion (Fitch).
The CrowdOut incident highlights how vulnerable systems are to disruption not only by cyber attacks but also by friendly fire. While regrettable, the CrowdOut incident is one of those singular events that poses a good opportunity to learn some valuable lessons.
CrowdStrike provides cloud-based cybersecurity software including endpoint, data and workload protection. Among other things, the company has a virus protection product for the ubiquitous Windows operating system. CrowdStrike frequently updates this software to protect against new and evolving viruses and malware. This is a good thing. However, early Friday morning, the firm distributed an update containing a software glitch that ended up disrupting calls between CrowdStrike’s software and Windows operating systems, causing Windows to hang and crash. The glitch affected Windows computers running CrowdStrike that were online at the time (that is, in a state to receive automated updates), amounting to a reported 8.5 million computers and servers worldwide.
The effects were widespread, resulting in inoperable ATMs, closed trading floors and other business disruptions across the financial industry. Fortunately, most issues were resolved quickly, within several hours. This contrasts dramatically with some other industries, particularly the airline sector, with some companies continuing to grapple with restoring systems. The financial industry’s relatively quick recovery can be attributed in part to the industry’s heavy emphasis on system availability and disaster recovery planning. It can also be attributed, ironically, to the prevalence of fragmented legacy system architectures and spaghetti ware connectivity, which helped localize outages within a bank.
This isn’t the first major tech blowout, but it’s the first widespread global event on this scale. Recent years have seen a long series of tech service interruptions, but they have mostly been limited to a single region (Armenia’s Internet in 2011, Australia’s telecom in 2023) or a single company (Gmail in 2020, Facebook in 2021). The most serious service disruption prior to CrowdOut was the Fastly crash in 2021, which had global reach but was over within an hour. These are just a fraction of the tech outages experienced in recent years and inevitably there will be more to come.
The CrowdOut incident illustrates how fundamentally security features have been integrated into tech systems. This is even more the case with modern devops and cloud-based solutions utilizing security-by-design principles. While CrowdStrike and Microsoft will be the ones making coding and procedural changes to ensure this particular type of glitch doesn’t happen again, the CrowdOut incident presents a good opportunity for tech strategists at financial institutions to reflect on how to increase technology resiliency.
- FIs should ensure they have a fast and effective recovery plan for software failures, including cold system backups, system recovery procedures and business continuity plans. While it is impossible to guard against every potential tech failure, discipline around recovery methods is crucial. An important lesson from the CrowdStrike outage is that well-prepared companies recovered quickly, while others spent days struggling to resurrect systems.
- The outage was a classic case of how a single point of failure can have devastating consequences. FIs should design their software architectures and system dependencies so that a single point of failure will not bring down entire systems. This is more difficult when the failure stems from industry-standard technology, which can trigger very widespread effects.
- Windows might be the prime example of an industry standard technology but commercial cloud providers are making significant inroads into FI technology stacks as well. CrowdOut presents a good opportunity for FIs to revisit their cloud-dependent technology and potentially implement multi-cloud architectures to protect against single source of failure incidents.
- It is not currently possible or reasonable to go over every line of code for every update of every solution, but FIs should double down on due diligence of their vendors’ development, quality control and testing processes. Additionally, FIs should ensure the completeness of their own testing procedures around software updates they deploy on premise.
At the same time, the CrowdStrike failure serves as a warning call to vendors to ensure the reliability of their offerings. Vendors should consider engineering redundancy principles to ensure that their software cannot be brought down by a single point of failure. Vendors that rely on other providers for pieces of technology should perform thorough due diligence on their technology partners’ development and quality control processes to reduce the risk of friendly fire flameouts.