23
votes
The global Cloudflare outage today was caused by a bad regex in a firewall rule that spiked CPU usage to 100% on all machines
Link information
This data is scraped automatically and may be incorrect.
- Title
- Cloudflare outage caused by bad software deploy
- Published
- Jul 2 2019
- Word count
- 121 words
it kinda owns that the entire backbone of the mainstream internet is literally held in the hands of about three services and if even one of those services fucks something up, a large portion of the mainstream internet goes out partially or completely. tech consolidation is a bitch in general, but it's especially a bitch with things like cloudflare.
It's also terrifying from a privacy perspective, where such a huge portion of everyone's internet activity is going through Cloudflare, and most people don't even know. In some cases, they even have access to more specific data about what a user is doing than their ISP would, since they're sometimes effectively intercepting HTTPS connections.
They make claims about how little of that they're tracking/retaining, but we really have no idea.
Assuming this page is accurate, you'd probably want to monitor for (or block) all of your traffic with any of these IP ranges: https://www.cloudflare.com/ips/
As for what you can do about it, probably not much, unless you're comfortable with losing access to a huge chunk of the internet. There's no way to bypass Cloudflare and go to the "source" instead, unless you can figure out the source server's IP through some other method (which won't be possible in general).
I wonder if it has to do with backreferences and other “fun” ways to get O(n!) time with regexps.
Probably—an accidental "ReDoS".
When I was building AutoModerator, I specifically used Google's re2 engine instead of a standard one, because it's non-backtracking. Since people would be able to define arbitrary regular expressions, I didn't want them to be able to take out the bot with a complex one intended to cause that "catastrophic backtracking" case. It's not able to support some regex features like lookahead and lookbehind, but it still allows almost all of the common uses.
One of authors of which, Russ Cox, also created the Go stdlib's regexp implementation. His series of articles about regexps is the reason I (and I assume a lot of other developers) even know about catastrophic backtracking.
Probably a great choice, because despite my extensive reliance on Automoderator when I was a mod on a large subreddit, I don't believe I once encountered a need to use any regexp features which weren't supported.
Deploying changes like this in a dry-run / simulated mode is a common best practice, but this gives a reminder that it's not enough.
It's tempting to think that because the rules are in simulated mode, they're low risk, and can be deployed worldwide rather than rolled out region by region, and don't need canary deployments even within a single region.
I noticed that the login service for the NYT was disrupted for most of the day today--is this the cause?
In this case I don't think so, or if it was the cause it was indirect. CF's postmortem puts the outage as happening between 13:42 and 14:02 UTC. Downdetector shows some login issues for NYT today, but not really spiking up until around 16:00 UTC.