Mobile MDR Has Arrived – Safeguard Your Execs from Zero-Day Threats Today.

Log in

Get Demo

Home

Blog

The CrowdStrike Saga: A calm and rational retrospective

Cybersecurity Risk Management

The CrowdStrike Saga: A calm and rational retrospective

Paul Ducklin

07/26/2024

Share this article:

Some of it was true

In the post-apocalyptic song London Calling, way back in December 1979, seminal British rock band The Clash famously reflected on the unreliability of reporting in the aftermath of disastrous events.

Feel free to sing along, if you know the catchy tune:

  London calling! 
     Yes, I was there, too.
  And you know what they said? 
     Well, some of it was true.

The recent CrowdStrike software crash on Friday 2024-07-19, which led to lots of computers temporarily put out of action and important online services such as credit card payments and airline boarding thrown abruptly into disarray, led not merely to a London Calling scenario, but into a Planet Earth Bellowing At Top Volume situation.

Headlines quickly gave us the nickname The Global IT Outage, and as reasonable as this description sounded at the time, the quantity and volume of the activity that followed on social media and on news sites didn’t give the impression of an online outage at all.

The internet as a whole seemed beset more by outrage than by outage.

Over the weekend and the next few days, we were deluged with a flurry of opinions, finger-pointing, truth, fiction, guesswork, experts bellowing blame at CrowdStrike, counter-experts leaping on soapboxes to exonerate the company, and not a few “definitive” explanations that turned out to be dangerously and divisively wrong.

Some of it was true, but what about all the rest?

Here’s a gentle, objective, and not-too-technical retrospective, presented without taking sides, so we can take stock of what happened, and why, and how…

…and so we can think about how to reduce the drama that a similar SNAFU might cause in future.

Are you sitting comfortably?

What happened, why, and how?

On Friday 2024-07-19, endpoint threat-blocking vendor CrowdStrike pushed out a signature update (a new data file) to give its Windows security software additional information on how to spot various new cybercrime tricks that could be signs of an attack.

Many cybersecurity products consider this sort of data-only change to be neither dangerous nor controversial, and send out such updates regularly, frequently, and automatically.

Technically, it’s not a patch, because the underlying program code hasn’t been modified to fix known bugs, and it’s not an upgrade, because no new features have been added.

The security application is still running the same underlying software, just with different data telling it what to look for.

The overall risk of rushing out this sort of update is generally considered low by vendors, CISOs and IT managers alike, especially compared to the possible cost of falling behind the crooks and getting breached as a result.

Why wait to push out new data if the code is unchanged?

In this case, sadly, an as-yet-undiscovered bug in CrowdStrike’s existing threat monitoring software meant that just loading the new signature file triggered a fault in the code.

CrowdStrike’s software attempted to read data from a memory location that didn’t exist, meaning that it had got itself into an inconsistent and unpredictable state, leaving the operating system no choice but to shut it down.

What turned a crash into a full-blown outage?

When applications get shut down unexpectedly by the operating system, we generally say that they have crashed, because that’s what it feels like if we are using them at the time.

Crashes are never good, because they mean than an app has gone haywire, and they can have bad side-effects.

A word processor crash can leave you with lost changes or a corrupted document; a sudden web server crash will at best leave visitors unimpressed, and at worst cause purchases to fail or orders to be lost.

Fortunately, however, Windows programs generally run independently alongside each other, carefully regulated by the the operating system.

The operating system ensures, amongst other things, that programs only look at files they’re authorised to see, only access memory that has been allocated for their own use, and don’t mess directly with hardware such Wi-Fi adapters and webcams that need to be shared safely between all programs in the system.

For the most part, applications that crash don’t bring down the entire computer, as long as the operating system itself isn’t corrupted in the process.

After a crash, if you can still access the affected computer, you can, in theory at least, update your buggy web server to fix the problem, roll back to an earlier version of your word processor, or modify your system configuration to sidestep the bug and to restore at least some level of service.

Unfortunately, the CrowdStrike software that crashed is what’s known as a kernel driver, which isn’t a regular sort of program at all.

In Windows, kernel drivers are software extensions that become part of the operating system kernel itself, where the word kernel is the metaphorical jargon term for the code at the core of the operating system.

Very loosely speaking, when a Windows kernel driver crashes, that driver can’t safely be shut down and restarted as if it were a regular application, because it is effectively the operating system itself that has crashed, so it can no longer be trusted to manage the rest of the system safely.

Simply put, terminating and restarting a buggy kernel driver means shutting down and rebooting the entire computer, which generally causes a more significant interruption than a single failed app that leaves everything else around it running.

When a kernel-level crash happens, Windows conveys the bad news by popping up a minimalist, full-screen warning known colloquially as a Blue Screen of Death (BSoD), complete with a “sad face” text emoticon :(, like the image you see at the top of this article, so you know that something worse than usual has happened.

By default, the computer will restart by itself and reload Windows, leading to a brief but complete system outage, during which time the affected computer won’t be usable locally, or accessible over the network.

Nevertheless, in an ideal world, this reboot will resolve the problem, or at least restore access to the affected system long enough for you to troubleshoot it, patch it, reconfigure it, or otherwise get it fixed.

Why didn’t the BSoD reboot repair the trouble?

You can probably guess what happened next, even if you didn’t witness it yourself.

Windows cybersecurity software isn’t supposed to add itself into the kernel simply because it wants to, but because it needs to, in order to keep the closest possible eye out for the tools and techniques used both by malware and by human cybercriminals.

And if you are going to keep a watch “from the inside”, you might as well get started as soon in the bootup process as you can, in the hope that you will be on patrol before any crooks who might be lurking in the system already.

Microsoft officially recognises this by providing a mechanism known by the acronym ELAM, short for early launch anti-malware, so that security-related drivers can get themselves activated at a preferentially early stage in the bootup process.

But cybersecurity software is notoriously complex, typically gets upgraded and updated more frequently than most other sorts of program, and often includes functions that deliberately interfere with system activities that it thinks are suspicious.

This means that kernel driver bugs in cybersecurity software can be considered not only more likely than bugs in other drivers, but also more likely to interfere disruptively with the system as a whole.

Even worse, if kernel drivers do fail, they could go wrong so early in the bootup process that the system crashes almost as soon as you turn it on, long before the logon screen appears, and before the computer can be accessed across the network.

This can lead to a cascade of BSoD events, where a reboot triggers a crash, which triggers a BSoD, which triggers another reboot, which triggers another crash, over and over so that Windows never gets as far as starting up normally.

And that’s what happened in the CrowdStrike incident: affected computers ended up stuck in a crash-and-try-again cycle known self-descriptively as a bootloop.

Why didn’t CrowdStrike fix this quickly?

To be fair, the company did just that, or at least tried to.

According to CrowdStrike’s own remediation guidance, a faulty signature file dated 2024-07-19T04:09:00Z was superseded by a repaired signature file dated 2024-07-19T05:27:00Z, less than an hour and twenty minutes later.

Once the new signature was available, CrowdStrike’s product on any affected computer could, in theory, heal itself by downloading the update, which the buggy driver would read safely next time.

In practice, however, the buggy driver tended to load and crash before the rest of the operating system had time to get going properly.

There often wasn’t time for Windows to activate the network, launch CrowdStrike’s updater, and successfully fetch the fix before the system crashed again.

Some users got lucky, and their systems did repair themselves; indeed CrowdStrike’s own recovery information includes this advice:

Reboot the [computer] to give it an opportunity to download the [patched] file. We strongly recommend putting the [computer] on a wired network (as opposed to Wi-Fi) prior to rebooting as the [computer] will acquire internet connectivity considerably faster via ethernet.

(Some commenters came up with the suggestion of rebooting over and over, no matter that it seemed hopeless to keep on doing so, saying that even after 15 failures or more, they unexpectedly but happily managed to recover their systems automatically.)

Other users weren’t so lucky, and their systems weren’t able to repair themselves.

Why not just download the patched file by hand?

Fixing the problem by hand (or using an automated tool such as a PowerShell script) turned out to be surprisingly simple:

Boot into Windows Recovery.
Open a command prompt window.
Change into the CrowdStrike drivers directory.
Delete any files with names starting C-00000291 and ending .SYS.
Reboot normally into Windows.

Easy as that!

Except for two annoying problems for corporate IT staff:

Rebooting into Windows Recovery typically requires someone sitting in front of the computer. Many non-technical remote laptop users needed calling up and coaching through the process. Many servers needed a physical visit by an incident response technician.
Computers protected by BitLocker typically can’t access Windows Recovery without a unique 48-digit cryptographic recovery key. This helps to ensure that stolen laptops can’t be booted directly to a C: prompt, or accessed as the D: drive simply by plugging them into another computer. Even tecnhnically savvy users couldn’t fix their own computers without a copy of that key.

For better or worse, most corporate IT departments don’t hand out recovery keys to their users, on the grounds that they might as well be stored centrally and securely, given that they will rarely be needed and are less likely to get lost, leaked, or stolen if they aren’t supplied to every user.

IT teams were therefore looking at calling up hundreds, or perhaps thousands, of users and talking them through the recovery process with great care (deleting the wrong files in the drivers directory could make a bad thing even worse), as well as painstakingly reading a 48-digit recovery key, one digit at a time, to each person they called.

Worse still, some IT departments found that their official database of recovery keys had never made it into an offline, offsite backup, as good backup habits suggest.

Instead, the recovery keys were stored in their Windows Active Directory database, which couldn’t be accessed until one or both of their Active Directory servers had been freed from their own bootloop outages, creating a Catch-22 situation.

In other words, the fix for the problem was fairly simple, but getting to the point that the fix could be carried out on each computer was not.

in this way, a large level of disruption was caused, leading to the nickname “The Great IT Outage”, with Microsoft’s own network measurements now suggesting that about 8,500,000 Windows computers were affected.

Although that’s a tiny percentage of all the Windows devices in the world, it’s still a large number of computers to be put out of action worldwide at the same time.

What to do for next time?

For IT managers

CrowdStrike and Microsoft now have bootable USB images and remote-boot disk images that can assist with incidents that need Windows Recovery. Even if you aren’t a CrowdStrike user, or avoided an outage this time, study these tools, and learn how to create your own, because this is a handy technique for a crisis.
Invest time in understanding BitLocker and the various configurations available, locally and remotely. Practice on a throwaway Windows image or test network (you can use virtual machines for this). Learn your way around the powerful manage-bde command-line program (BitLocker Drive Encryption Configuration Tool), and be sure you understand BitLocker terminology including PINs, passwords, protectors and recovery keys, also known as numeric passwords.
If you’ve just spent the weekend reading out recovery keys to hundreds of users who wouldn’t normally be trusted with them, you may want to regenerate and securely store new recovery keys for all their computers.
Make sure that you can reliably get at any cryptographic keys you need to access computers you might end up locked out of during an outage. Don’t store those keys where the outage is likely to take them out too. (In simple terms, make sure you don’t lock your car keys in the car.)
More generally, make sure that you have a reliable backup process for all your computers and servers, ideally including at least one backup that is stored offline and offsite. This will help you to recover not only from bootloops or failed updates, but also from more dramatic catastrophes such as broken hardware, fire, flood, theft, and cybercriminal attacks such as data wipers or ransomware. (Remember to practice and perfect the restoration process, too. A backup that you can’t restore in a timely fashion doesn’t count as a backup.)

For end users

In an outage that’s affecting a significant swathe of users in the company, don’t panic. As the old saying goes, “Lead, follow, or get out of the way.” Doing nothing until you are certain you have received correct information is better than “helping” inexpertly and possibly making things worse.
Take the time today to learn how to contact your own IT experts in an emergency, and how they will contact you, so you don’t end up accepting advice from dubious sources by mistake. During the CrowdStrike saga, cyberscammers were quick to send out rogue “instructions” in the hope of tricking well-meaning users into subverting their own security.
Don’t jump online and desperately start searching for “cures” that others claim worked for them. As well as the risk of rogue instructions from outright crooks, you may also stumble on inept or incorrect advice from commenters whose “expertise” is in making themselves heard on social media, not in cybersecurity.
If your IT team ask you to be ready for a call in a certain time window, try to prepare yourself in advance to create the best chance of success. If you can, find a quiet and private area to set up your computer; ensure you have network access available if that’s been requested; take pen and paper with you in case you need to write down instructions; and consider using headphones or earbuds to ensure clarity and privacy.

Finally, if you were caught up in this drama, or are still wrangling with its aftermath, don’t throw the baby out with the bathwater and give up on cybersecurity in frustration.

Remember what The Clash said right at the end of the song:

  London calling, 
     at the top of the dial.
  After all this, 
     won't you give me a smile?

Stay calm, keep informed, stay safe.

Why not ask how SolCyber can help you do cybersecurity in the most human-friendly way? Don’t get stuck behind an ever-expanding convoy of security tools that leave you at the whim of policies and procedures that are dictated by the tools, even though they don’t suit your IT team, your colleagues, or your customers!

The CrowdStrike Saga: A calm and rational retrospective - SolCyber

More About Duck

Paul Ducklin is a respected expert with more than 30 years of experience as a programmer, reverser, researcher and educator in the cybersecurity industry. Duck, as he is known, is also a globally respected writer, presenter and podcaster with an unmatched knack for explaining even the most complex technical issues in plain English. Read, learn, enjoy!

Paul Ducklin

07/26/2024

Share this article:

The world doesn’t need another traditional MSSP  or MDR or XDR.

What it requires is practicality and reason.

tour the product

Choose identity-first managed security.

We start with identity and end with transparency — protecting where attacks begin and keeping you informed, with as much visibility as you want. No black boxes, just clear, expert-driven security.

Get Started

No more paying for useless bells and whistles.

No more time wasted on endless security alerts.

No more juggling multiple technologies and contracts.