17
votes
How CrowdStrike stopped everything. “The failures cascaded as dependent systems crashed, halting operations across multiple sectors."
Link information
This data is scraped automatically and may be incorrect.
- Title
- How CrowdStrike Stopped Everything
- Authors
- David Geer, Saurabh Bagchi, Yael Erez, Koby Mike, Orit Hazzan
- Published
- Aug 30 2024
- Word count
- 1446 words
The article mentions that less than 1% of windows installations were affected, but I think the more important metric is what percent of people were affected. 1% of machines represents a far larger number of people, as much critical infrastructure was impacted. For every one computer tied to a 911 call center, how many people could not receive timely medical care. Were there any deaths that happened as a result? What about the hospitals impacted? Was patient care impacted long term?
If 1% of machines were affected, but these were core components in society, then CS needs to be held accountable for such a non-standard and negligent problem.
I actually disagree.
I'm heavy in the IT field. Shit breaks. The bigger issue is reliance on single point of failure.
The fact that company's eggs in one basket Crowdstrike on ALL the servers is casing point. Since the late 90s, we split servers to having different AV products so that if one failed the other should catch and the whole infrastructure should not be nuked. I came from corporate IT and when I left, they still had that mentality.
There's a reason they do not do critical infrastructure on space ships with Windows. There's a reason manufacturing use mainframe AS400 systems. Windows has never been known for its stability. It's much better these days, but it's still taken out easily, hence Crowdstrike breaking it.
From MSs point of view, they were forced to allow kernel access whereas Apple never have had to. If they could have filtered the access the BSODs would never have happened. From Crowdstrike's side, nothing glitched before they hit the button and not everything magically fell over. A lot of stars had to align for this to happen.
Best outcome, CRWD give companies testing rings and times before patch release to production happens.
I'm all for holding companies accountable for having critical infrastructure on identical systems, but at the same time, we've known how to deploy mass in use software for decades. You don't do it all at once, you don't ignore user settings on deployment, and you don't do it on a Friday.
Crowdstrike's methods completely ignored all of this, and quite a few less stars needed to align because of it. They are absolutely at fault for this alone.
Edit:
Just to go into more detail in this, there's literally no way to be responsible with an error like this.
You system NEEDS AV software, and some systems NEED(for a given value) to be windows. You're not going to go to two serperate providers for that, but the responsible thing to do is be able to tier your failover server to not update when the main one does. That way if something critical fails, your system will failover to the unpatched server, and you're good to go.
The whole issue is their deployment either didn't allow this, or completely ignored it, which is fucking insane to me.
Considering every machine running their software fell over almost instantly, I strongly suspect they don't have automated integration tests. The CI/CD pipeline should spin up a VM containing the software, and if something so obvious as failing to boot happens, the "smoke test" should fail and the deployment should be halted.
I believe the argument around this is that the way the change was tagged it didn't go through that process because supposedly the code shouldn't have been able to cause this kind of failure (i could be wrong, I haven't followed that closely).
It still doesn't make sense to mass deploy any change though.
Wouldn't the same problem also apply on Linux (if something crashes in kernel space it will take the whole system down)?