23 votes

The global Cloudflare outage today was caused by a bad regex in a firewall rule that spiked CPU usage to 100% on all machines

11 comments

  1. [4]
    alyaza
    Link
    it kinda owns that the entire backbone of the mainstream internet is literally held in the hands of about three services and if even one of those services fucks something up, a large portion of...

    it kinda owns that the entire backbone of the mainstream internet is literally held in the hands of about three services and if even one of those services fucks something up, a large portion of the mainstream internet goes out partially or completely. tech consolidation is a bitch in general, but it's especially a bitch with things like cloudflare.

    16 votes
    1. [3]
      Deimos
      (edited )
      Link Parent
      It's also terrifying from a privacy perspective, where such a huge portion of everyone's internet activity is going through Cloudflare, and most people don't even know. In some cases, they even...

      It's also terrifying from a privacy perspective, where such a huge portion of everyone's internet activity is going through Cloudflare, and most people don't even know. In some cases, they even have access to more specific data about what a user is doing than their ISP would, since they're sometimes effectively intercepting HTTPS connections.

      They make claims about how little of that they're tracking/retaining, but we really have no idea.

      15 votes
      1. [2]
        suspended
        Link Parent
        How would someone find out and what could be done about it?

        a huge portion of everyone's internet activity is going through Cloudflare

        How would someone find out and what could be done about it?

        3 votes
        1. Deimos
          Link Parent
          Assuming this page is accurate, you'd probably want to monitor for (or block) all of your traffic with any of these IP ranges: https://www.cloudflare.com/ips/ As for what you can do about it,...

          Assuming this page is accurate, you'd probably want to monitor for (or block) all of your traffic with any of these IP ranges: https://www.cloudflare.com/ips/

          As for what you can do about it, probably not much, unless you're comfortable with losing access to a huge chunk of the internet. There's no way to bypass Cloudflare and go to the "source" instead, unless you can figure out the source server's IP through some other method (which won't be possible in general).

          8 votes
  2. [4]
    ainar-g
    Link
    I wonder if it has to do with backreferences and other “fun” ways to get O(n!) time with regexps.

    I wonder if it has to do with backreferences and other “fun” ways to get O(n!) time with regexps.

    5 votes
    1. [3]
      Deimos
      Link Parent
      Probably—an accidental "ReDoS". When I was building AutoModerator, I specifically used Google's re2 engine instead of a standard one, because it's non-backtracking. Since people would be able to...

      Probably—an accidental "ReDoS".

      When I was building AutoModerator, I specifically used Google's re2 engine instead of a standard one, because it's non-backtracking. Since people would be able to define arbitrary regular expressions, I didn't want them to be able to take out the bot with a complex one intended to cause that "catastrophic backtracking" case. It's not able to support some regex features like lookahead and lookbehind, but it still allows almost all of the common uses.

      9 votes
      1. ainar-g
        Link Parent
        One of authors of which, Russ Cox, also created the Go stdlib's regexp implementation. His series of articles about regexps is the reason I (and I assume a lot of other developers) even know about...

        When I was building AutoModerator, I specifically used Google's re2 engine instead of a standard one, because it's non-backtracking.

        One of authors of which, Russ Cox, also created the Go stdlib's regexp implementation. His series of articles about regexps is the reason I (and I assume a lot of other developers) even know about catastrophic backtracking.

        8 votes
      2. emdash
        Link Parent
        Probably a great choice, because despite my extensive reliance on Automoderator when I was a mod on a large subreddit, I don't believe I once encountered a need to use any regexp features which...

        Probably a great choice, because despite my extensive reliance on Automoderator when I was a mod on a large subreddit, I don't believe I once encountered a need to use any regexp features which weren't supported.

        4 votes
  3. spit-evil-olive-tips
    Link
    Deploying changes like this in a dry-run / simulated mode is a common best practice, but this gives a reminder that it's not enough. It's tempting to think that because the rules are in simulated...

    These rules were being deployed in a simulated mode where issues are identified and logged by the new rule but no customer traffic is actually blocked so that we can measure false positive rates and ensure that the new rules do not cause problems when they are deployed into full production.

    Deploying changes like this in a dry-run / simulated mode is a common best practice, but this gives a reminder that it's not enough.

    Unfortunately, these WAF rules were deployed globally in one go and caused today’s outage.

    It's tempting to think that because the rules are in simulated mode, they're low risk, and can be deployed worldwide rather than rolled out region by region, and don't need canary deployments even within a single region.

    5 votes
  4. [2]
    SovietJugernaut
    Link
    I noticed that the login service for the NYT was disrupted for most of the day today--is this the cause?

    I noticed that the login service for the NYT was disrupted for most of the day today--is this the cause?

    3 votes
    1. spit-evil-olive-tips
      Link Parent
      In this case I don't think so, or if it was the cause it was indirect. CF's postmortem puts the outage as happening between 13:42 and 14:02 UTC. Downdetector shows some login issues for NYT today,...

      In this case I don't think so, or if it was the cause it was indirect. CF's postmortem puts the outage as happening between 13:42 and 14:02 UTC. Downdetector shows some login issues for NYT today, but not really spiking up until around 16:00 UTC.

      4 votes