In the post-apocalyptic song London Calling, way back in December 1979, seminal British rock band The Clash famously reflected on the unreliability of reporting in the aftermath of disastrous events.
Feel free to sing along, if you know the catchy tune:
London calling! Yes, I was there, too. And you know what they said? Well, some of it was true.
The recent CrowdStrike software crash on Friday 2024-07-19, which led to lots of computers temporarily put out of action and important online services such as credit card payments and airline boarding thrown abruptly into disarray, led not merely to a London Calling scenario, but into a Planet Earth Bellowing At Top Volume situation.
Headlines quickly gave us the nickname The Global IT Outage, and as reasonable as this description sounded at the time, the quantity and volume of the activity that followed on social media and on news sites didn’t give the impression of an online outage at all.
The internet as a whole seemed beset more by outrage than by outage.
Over the weekend and the next few days, we were deluged with a flurry of opinions, finger-pointing, truth, fiction, guesswork, experts bellowing blame at CrowdStrike, counter-experts leaping on soapboxes to exonerate the company, and not a few “definitive” explanations that turned out to be dangerously and divisively wrong.
Some of it was true, but what about all the rest?
Here’s a gentle, objective, and not-too-technical retrospective, presented without taking sides, so we can take stock of what happened, and why, and how…
…and so we can think about how to reduce the drama that a similar SNAFU might cause in future.
Are you sitting comfortably?
On Friday 2024-07-19, endpoint threat-blocking vendor CrowdStrike pushed out a signature update (a new data file) to give its Windows security software additional information on how to spot various new cybercrime tricks that could be signs of an attack.
Many cybersecurity products consider this sort of data-only change to be neither dangerous nor controversial, and send out such updates regularly, frequently, and automatically.
Technically, it’s not a patch, because the underlying program code hasn’t been modified to fix known bugs, and it’s not an upgrade, because no new features have been added.
The security application is still running the same underlying software, just with different data telling it what to look for.
The overall risk of rushing out this sort of update is generally considered low by vendors, CISOs and IT managers alike, especially compared to the possible cost of falling behind the crooks and getting breached as a result.
Why wait to push out new data if the code is unchanged?
In this case, sadly, an as-yet-undiscovered bug in CrowdStrike’s existing threat monitoring software meant that just loading the new signature file triggered a fault in the code.
CrowdStrike’s software attempted to read data from a memory location that didn’t exist, meaning that it had got itself into an inconsistent and unpredictable state, leaving the operating system no choice but to shut it down.
When applications get shut down unexpectedly by the operating system, we generally say that they have crashed, because that’s what it feels like if we are using them at the time.
Crashes are never good, because they mean than an app has gone haywire, and they can have bad side-effects.
A word processor crash can leave you with lost changes or a corrupted document; a sudden web server crash will at best leave visitors unimpressed, and at worst cause purchases to fail or orders to be lost.
Fortunately, however, Windows programs generally run independently alongside each other, carefully regulated by the the operating system.
The operating system ensures, amongst other things, that programs only look at files they’re authorised to see, only access memory that has been allocated for their own use, and don’t mess directly with hardware such Wi-Fi adapters and webcams that need to be shared safely between all programs in the system.
For the most part, applications that crash don’t bring down the entire computer, as long as the operating system itself isn’t corrupted in the process.
After a crash, if you can still access the affected computer, you can, in theory at least, update your buggy web server to fix the problem, roll back to an earlier version of your word processor, or modify your system configuration to sidestep the bug and to restore at least some level of service.
Unfortunately, the CrowdStrike software that crashed is what’s known as a kernel driver, which isn’t a regular sort of program at all.
In Windows, kernel drivers are software extensions that become part of the operating system kernel itself, where the word kernel is the metaphorical jargon term for the code at the core of the operating system.
Very loosely speaking, when a Windows kernel driver crashes, that driver can’t safely be shut down and restarted as if it were a regular application, because it is effectively the operating system itself that has crashed, so it can no longer be trusted to manage the rest of the system safely.
Simply put, terminating and restarting a buggy kernel driver means shutting down and rebooting the entire computer, which generally causes a more significant interruption than a single failed app that leaves everything else around it running.
When a kernel-level crash happens, Windows conveys the bad news by popping up a minimalist, full-screen warning known colloquially as a Blue Screen of Death (BSoD), complete with a “sad face” text emoticon :(
, like the image you see at the top of this article, so you know that something worse than usual has happened.
By default, the computer will restart by itself and reload Windows, leading to a brief but complete system outage, during which time the affected computer won’t be usable locally, or accessible over the network.
Nevertheless, in an ideal world, this reboot will resolve the problem, or at least restore access to the affected system long enough for you to troubleshoot it, patch it, reconfigure it, or otherwise get it fixed.
You can probably guess what happened next, even if you didn’t witness it yourself.
Windows cybersecurity software isn’t supposed to add itself into the kernel simply because it wants to, but because it needs to, in order to keep the closest possible eye out for the tools and techniques used both by malware and by human cybercriminals.
And if you are going to keep a watch “from the inside”, you might as well get started as soon in the bootup process as you can, in the hope that you will be on patrol before any crooks who might be lurking in the system already.
Microsoft officially recognises this by providing a mechanism known by the acronym ELAM, short for early launch anti-malware, so that security-related drivers can get themselves activated at a preferentially early stage in the bootup process.
But cybersecurity software is notoriously complex, typically gets upgraded and updated more frequently than most other sorts of program, and often includes functions that deliberately interfere with system activities that it thinks are suspicious.
This means that kernel driver bugs in cybersecurity software can be considered not only more likely than bugs in other drivers, but also more likely to interfere disruptively with the system as a whole.
Even worse, if kernel drivers do fail, they could go wrong so early in the bootup process that the system crashes almost as soon as you turn it on, long before the logon screen appears, and before the computer can be accessed across the network.
This can lead to a cascade of BSoD events, where a reboot triggers a crash, which triggers a BSoD, which triggers another reboot, which triggers another crash, over and over so that Windows never gets as far as starting up normally.
And that’s what happened in the CrowdStrike incident: affected computers ended up stuck in a crash-and-try-again cycle known self-descriptively as a bootloop.
To be fair, the company did just that, or at least tried to.
According to CrowdStrike’s own remediation guidance, a faulty signature file dated 2024-07-19T04:09:00Z was superseded by a repaired signature file dated 2024-07-19T05:27:00Z, less than an hour and twenty minutes later.
Once the new signature was available, CrowdStrike’s product on any affected computer could, in theory, heal itself by downloading the update, which the buggy driver would read safely next time.
In practice, however, the buggy driver tended to load and crash before the rest of the operating system had time to get going properly.
There often wasn’t time for Windows to activate the network, launch CrowdStrike’s updater, and successfully fetch the fix before the system crashed again.
Some users got lucky, and their systems did repair themselves; indeed CrowdStrike’s own recovery information includes this advice:
Reboot the [computer] to give it an opportunity to download the [patched] file. We strongly recommend putting the [computer] on a wired network (as opposed to Wi-Fi) prior to rebooting as the [computer] will acquire internet connectivity considerably faster via ethernet.
(Some commenters came up with the suggestion of rebooting over and over, no matter that it seemed hopeless to keep on doing so, saying that even after 15 failures or more, they unexpectedly but happily managed to recover their systems automatically.)
Other users weren’t so lucky, and their systems weren’t able to repair themselves.
Fixing the problem by hand (or using an automated tool such as a PowerShell script) turned out to be surprisingly simple:
C-00000291
and ending .SYS
.Easy as that!
Except for two annoying problems for corporate IT staff:
C:
prompt, or accessed as the D:
drive simply by plugging them into another computer. Even tecnhnically savvy users couldn’t fix their own computers without a copy of that key.For better or worse, most corporate IT departments don’t hand out recovery keys to their users, on the grounds that they might as well be stored centrally and securely, given that they will rarely be needed and are less likely to get lost, leaked, or stolen if they aren’t supplied to every user.
IT teams were therefore looking at calling up hundreds, or perhaps thousands, of users and talking them through the recovery process with great care (deleting the wrong files in the drivers
directory could make a bad thing even worse), as well as painstakingly reading a 48-digit recovery key, one digit at a time, to each person they called.
Worse still, some IT departments found that their official database of recovery keys had never made it into an offline, offsite backup, as good backup habits suggest.
Instead, the recovery keys were stored in their Windows Active Directory database, which couldn’t be accessed until one or both of their Active Directory servers had been freed from their own bootloop outages, creating a Catch-22 situation.
In other words, the fix for the problem was fairly simple, but getting to the point that the fix could be carried out on each computer was not.
in this way, a large level of disruption was caused, leading to the nickname “The Great IT Outage”, with Microsoft’s own network measurements now suggesting that about 8,500,000 Windows computers were affected.
Although that’s a tiny percentage of all the Windows devices in the world, it’s still a large number of computers to be put out of action worldwide at the same time.
manage-bde
command-line program (BitLocker Drive Encryption Configuration Tool), and be sure you understand BitLocker terminology including PINs, passwords, protectors and recovery keys, also known as numeric passwords.Finally, if you were caught up in this drama, or are still wrangling with its aftermath, don’t throw the baby out with the bathwater and give up on cybersecurity in frustration.
Remember what The Clash said right at the end of the song:
London calling, at the top of the dial. After all this, won't you give me a smile?
Stay calm, keep informed, stay safe.
Why not ask how SolCyber can help you do cybersecurity in the most human-friendly way? Don’t get stuck behind an ever-expanding convoy of security tools that leave you at the whim of policies and procedures that are dictated by the tools, even though they don’t suit your IT team, your colleagues, or your customers!
Paul Ducklin is a respected expert with more than 30 years of experience as a programmer, reverser, researcher and educator in the cybersecurity industry. Duck, as he is known, is also a globally respected writer, presenter and podcaster with an unmatched knack for explaining even the most complex technical issues in plain English. Read, learn, enjoy!