51 votes

Cloudflare is down causing multiple services to break

39 comments

  1. [11]
    patience_limited
    (edited )
    Link
    And this is why hospitals shouldn't migrate their workloads to GCP. This is the second major incident in the last 30 days, and both times we've had end users calling our Help Desk asking why our...

    And this is why hospitals shouldn't migrate their workloads to GCP. This is the second major incident in the last 30 days, and both times we've had end users calling our Help Desk asking why our app is down, instead of contacting their own Help Desks because the wait time for hospital IT was too long. 🤔

    Healthcare is way too vulnerable to shady IT consultants bragging about how much they can cut costs. We've found that critical apps moved to GCP is a Bad SignTM.

    23 votes
    1. [10]
      okiyama
      Link Parent
      I take it you put stock in AWS instead? I do, as well (literally), but this whole putting critical infrastructure that determines if people live or die into the hands of private companies is not a...

      I take it you put stock in AWS instead? I do, as well (literally), but this whole putting critical infrastructure that determines if people live or die into the hands of private companies is not a good path forward.

      7 votes
      1. [4]
        patience_limited
        Link Parent
        I learned my trade in the days of company-owned infrastructure. You had at least some on-premises servers, and the "cloud" was an off-premises datacenter (or several geographically scattered ones)...

        I learned my trade in the days of company-owned infrastructure. You had at least some on-premises servers, and the "cloud" was an off-premises datacenter (or several geographically scattered ones) where you still owned the physical equipment and rented a private cage.

        Downtime did happen. But it was usually mitigated by diversity of systems and software - catastrophes didn't tend to cascade through reliance on single universal tools. There was investment in reliability and redundancy of critical systems (and people!). Patches got staged so all the Citrix gateways didn't go down at once; a subtle Brocade bug only slowed down the most current generation of storage, and so on.

        I don't know if the consolidation and homogenization of outsourced AWS, Azure, or Google Cloud services is as widespread outside the U.S. But our corporate tax laws made compute as an operating expense much more favorable than compute as a capital expense. So everything-as-a-service, with subscription costs instead of ownership costs, has taken over. Even though these services are opaque, less reliable, harder to manage properly, less secure...

        [Insert old person yells at Clouds here.]

        11 votes
        1. [3]
          Greg
          Link Parent
          I think there's a healthy middle ground - I've worked for plenty of organisations that can't justify a dedicated sysadmin of any kind, let alone at the level Google or Amazon can afford to hire,...

          I think there's a healthy middle ground - I've worked for plenty of organisations that can't justify a dedicated sysadmin of any kind, let alone at the level Google or Amazon can afford to hire, so it's very much made sense to treat cloud pricing basically like sysadmin-as-a-service. On demand scaling to load is also one I'd be reluctant to give up!

          But it's always gonna hinge on:

          There was investment in reliability and redundancy of critical systems (and people!)

          Because yeah, you can take pride in a bulletproof multi-cloud terraform setup that'll seamlessly load balance from one to the next, and won't even notice an outage beyond some of those allocation numbers shifting down to zero while others ramp up to compensate.

          But you'll only get that after a lot of long conversations with bizdev about how and why we're sticking to nice, clean, industry standard software inside those cloud VMs and not shifting everything to whichever proprietary do-everything service the sales rep was pitching today.

          3 votes
          1. [2]
            patience_limited
            Link Parent
            My analytical lens is based in the world of healthcare, which is a vast conglomeration of edge cases. Everything from bog-standard Windows/MS-SQL applications, to weird hybrid PACS [Windows app...

            My analytical lens is based in the world of healthcare, which is a vast conglomeration of edge cases. Everything from bog-standard Windows/MS-SQL applications, to weird hybrid PACS [Windows app servers and custom Linux container appliances with fancy storage fabrics and high-capacity optical switches], to ancient VAX/VMS and old VB6 and MS-DOS programs that no one has figured out how to replace, to the endlessly proprietary universes of Epic and other EMRs.

            I've worked for a couple of organizations that went gung-ho in trying to get everything off-prem, and laid off most of their in-house expertise. I'll acknowledge that's made me (perhaps excessively) cynical about the results, but I think there's been some willful blindness to the needs of different application and network architectures.

            5 votes
            1. Greg
              Link Parent
              I feel that! It’s so often “let’s uncritically accept the vendor promises and save on paying those pesky salaries” and so rarely “let’s listen to the people we’re paying those salaries to and see...

              I feel that! It’s so often “let’s uncritically accept the vendor promises and save on paying those pesky salaries” and so rarely “let’s listen to the people we’re paying those salaries to and see how we can make things more efficient in the long term”.

              2 votes
      2. [5]
        teaearlgraycold
        Link Parent
        Would it be better to have each hospital with its own data center? Or maybe each town?

        Would it be better to have each hospital with its own data center? Or maybe each town?

        2 votes
        1. patience_limited
          Link Parent
          That's a hard call. I've worked with enough different U.S. health systems to have observed wildly varying capacity for managing technology services. Some systems are writing their own software and...

          That's a hard call. I've worked with enough different U.S. health systems to have observed wildly varying capacity for managing technology services. Some systems are writing their own software and developing new technologies down to the firmware and chips (usually very well funded private university research institutions). Some (usually rural regional hospitals) can barely fund outsourced IT managed from Chennai.

          I wouldn't mind seeing an open-source "Datacenter in a Box", the conceptual equivalent of the old "Internet in a Box". This would consist of a modular, standardized shipping container with a given amount of vCPU, vRAM, bandwidth, etc. You could run whatever loads on it you wanted to keep local, but mesh that container with a national network so that spare capacity can be distributed and data stored in geographically redundant locations.

          Keep the operating systems minimal, roll updates to fractional portions of the mesh, have regional and national publicly funded management teams... A socialist can dream.

          8 votes
        2. [3]
          okiyama
          Link Parent
          Each hospital should be able to run its essential functions without the internet. It's not happening, but the proper solution would be a socialized basic network.

          Each hospital should be able to run its essential functions without the internet. It's not happening, but the proper solution would be a socialized basic network.

          7 votes
          1. [2]
            Minori
            Link Parent
            Are they not already able to practice without the internet? Charting is important, but it's secondary in life-or-death situations. Without internet, any kind of digital charting can't be trusted...

            Are they not already able to practice without the internet? Charting is important, but it's secondary in life-or-death situations.

            Without internet, any kind of digital charting can't be trusted since it can't sync across locations. For example, if a patient gets a prescription from a doctor then ends up in the ER, incoherent due to a side effect, the hospital can't trust the medical records have up-to-date prescription information.

            3 votes
            1. patience_limited
              Link Parent
              It's not just electronic medical records. I spend a substantial chunk of my time these days on interfaces, so I've had to work with nurse call vendors on hospital-mandated cloud migration...

              It's not just electronic medical records.

              I spend a substantial chunk of my time these days on interfaces, so I've had to work with nurse call vendors on hospital-mandated cloud migration projects.

              Consider the hubris involved in assuming that your on-premises nurse call system will communicate just fine with its cloud-hosted servers, given the latencies of geographic distance, and firewall policies for all of the (dozens!) of TCP port allows, with various layers of NAT. Guess how reliably those old Windows apps behave with elastic CPU, RAM, and storage provisioning. [Some nurse call vendors are in the process of releasing versions that are designed for PaaS, but the cost of ownership is substantially higher, so it might be a decade or more before they're the norm.]

              That's just one safety-critical infrastructure system that's really not well suited for AWS, Azure, or Google Cloud hosting. One project I'm aware of had to roll back to on-prem infrastructure three times, and hasn't migrated successfully yet.

              8 votes
  2. [4]
    Wolf_359
    Link
    Pretty sure it was me guys. I just built a flashcard app using react and did a pretty shit job overall. It took me about two hours and I used AI (heavily) to build the prototype version. It was...

    Pretty sure it was me guys.

    I just built a flashcard app using react and did a pretty shit job overall. It took me about two hours and I used AI (heavily) to build the prototype version.

    It was pretty soon after that firebase went down.

    But my Sp.Ed. students are going to fail the social studies final slightly less than they would have otherwise, so I think it was worth a bit of a worldwide outage.

    27 votes
    1. rogue_cricket
      Link Parent
      I actually thought it was me, because I just started a new job as a website reliability engineer...

      I actually thought it was me, because I just started a new job as a website reliability engineer...

      9 votes
    2. [2]
      Eji1700
      Link Parent
      Can I give you a list of dates and times I'd rather not work too hard to attempt this project again?

      Can I give you a list of dates and times I'd rather not work too hard to attempt this project again?

      7 votes
      1. Wolf_359
        Link Parent
        Absolutely. I'll be off this summer and my plan is to sink so much time and energy into coding educational apps that my students almost don't fail next year. Win win!

        Absolutely. I'll be off this summer and my plan is to sink so much time and energy into coding educational apps that my students almost don't fail next year. Win win!

        3 votes
  3. [2]
    hamstergeddon
    Link
    Claude and Gemini (AI chatbot agent buzzword things) are down as a result. I almost had to write code like a caveman, but thank the code gods chatGPT still works. (mostly /s, but they are down)

    Claude and Gemini (AI chatbot agent buzzword things) are down as a result. I almost had to write code like a caveman, but thank the code gods chatGPT still works. (mostly /s, but they are down)

    11 votes
    1. jonah
      Link Parent
      What am I supposed to do, write my unit tests manually?

      What am I supposed to do, write my unit tests manually?

      14 votes
  4. [5]
    Luna
    Link
    Cloudflare's postmortem: https://blog.cloudflare.com/cloudflare-service-outage-june-12-2025/ GCP says a full post-mortem for their outage is forthcoming in the next few days. I suspect that will...

    Cloudflare's postmortem: https://blog.cloudflare.com/cloudflare-service-outage-june-12-2025/

    GCP says a full post-mortem for their outage is forthcoming in the next few days. I suspect that will be an interesting read.

    10 votes
    1. Minori
      Link Parent
      It makes sense that they depend on a key value data store somewhere considering the ultra-low latencies they're targeting. Key value data stores are theoretically simple to implement, but they're...

      Workers KV today relies on a central data store to provide a source of truth for data. A failure of that store caused a complete outage for cold reads and writes to the KV namespaces used by services across Cloudflare.

      Workers KV is in the process of being transitioned to significantly more resilient infrastructure for its central store: regrettably, we had a gap in coverage which was exposed during this incident.

      It makes sense that they depend on a key value data store somewhere considering the ultra-low latencies they're targeting. Key value data stores are theoretically simple to implement, but they're devilishly complicated to scale and maintain. The complicated design trade-offs are not obvious until you scale to billions of records with global reads and writes.

      I can't blame them for using a tried-and-true offering from a well-known cloud platform. Though after this outage, they're certainly going to be speeding up that migration to an internal solution.

      4 votes
    2. skybrian
      Link Parent
      Here is Google’s postmortem: https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW Null pointer exception caused by unexpected blank fields in a configuration for a quota system, combined...

      Here is Google’s postmortem:

      https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW

      Null pointer exception caused by unexpected blank fields in a configuration for a quota system, combined with a lack of other safety features that would normally mitigate it: apparently the config came from a Spanner database that was instantly replicated, instead of normal config that would be rolled out gradually. The functionality wasn’t hidden behind a feature flag. A code change was rolled out within 40 minutes, but a lack of exponential backoff caused a thundering herd problem, delaying recovery.

      I’m sure all of that will be fixed, and the next time something like this happens it will be in some unrelated system where best practices were somehow ignored.

      4 votes
    3. Greg
      Link Parent
      Huh, that is interesting - like you said above, I'm very surprised that Cloudflare had a dependency on an external provider backing critical services rather than just as a failover for their own...

      Huh, that is interesting - like you said above, I'm very surprised that Cloudflare had a dependency on an external provider backing critical services rather than just as a failover for their own in-house systems.

      2 votes
    4. Wafik
      Link Parent
      Thanks for posting this.

      Thanks for posting this.

      1 vote
  5. [8]
    smores
    (edited )
    Link
    GCP is also experiencing outages (huge ones): https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW Nearly every product in maybe every region? I believe this is why Gemini is down,...

    GCP is also experiencing outages (huge ones):

    https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW

    Nearly every product in maybe every region? I believe this is why Gemini is down, @hamstergeddon. I think this is unrelated to the Cloudflare outage (or at least, these two service providers don't depend on each other in any way that I'm aware of).

    Found this out because both Expo and NPM became unresponsive at the same time. Guess I'm all done with side projects for the day!

    Edit: actually, the GCP outage appears to be the root cause of the Claude outage as well! https://status.anthropic.com/

    9 votes
    1. Akir
      Link Parent
      The double outage actually caused most of the internal applications at my work to go down today! It's crazy.

      The double outage actually caused most of the internal applications at my work to go down today! It's crazy.

      4 votes
    2. [3]
      hamstergeddon
      Link Parent
      oh wow double outages! I just assumed cloudflare was sitting in front of Claude, although I guess it would be odd if Gemini did the same given that it's google's service. Thanks for the info!

      oh wow double outages! I just assumed cloudflare was sitting in front of Claude, although I guess it would be odd if Gemini did the same given that it's google's service. Thanks for the info!

      3 votes
      1. [2]
        Greg
        Link Parent
        I’m wondering if it was some kind of cascading failure? Presumably a significant Google outage could shift an absolute ton of traffic to CloudFlare (or vice versa) if companies are using both for...

        oh wow double outages!

        I’m wondering if it was some kind of cascading failure? Presumably a significant Google outage could shift an absolute ton of traffic to CloudFlare (or vice versa) if companies are using both for redundancy.

        7 votes
        1. hamstergeddon
          Link Parent
          Ooo maybe. Looking forward to reading the postmortem on this for sure.

          Ooo maybe. Looking forward to reading the postmortem on this for sure.

          5 votes
    3. [2]
      Luna
      Link Parent
      Talk about a stressful afternoon! I first noticed when our integration tests started failing which I initially brushed off as a transient error, at least until I retried it multiple times and kept...

      Talk about a stressful afternoon! I first noticed when our integration tests started failing which I initially brushed off as a transient error, at least until I retried it multiple times and kept getting the same errors from Google Cloud Storage. Then I got pinged that our Apigee portal was offline...

      Thankfully, the Apigee portal was our only outage (the proxying continued working as normal), but it was quite nerve-racking since if anything did go down, we were powerless. We couldn't even SSH into our nodes to manually read our logs, much less remediate anything.

      Perhaps the most surprising revelation from all of this is that Cloudflare relies on GCP. Granted, it's not as surprising as if AWS or Azure was (Cloudflare's offerings are nowhere near as extensive, so it makes some sense that they would outsource some tasks), but I always imagined each major cloud provider as being its own silo for whatever reason.

      3 votes
      1. smores
        Link Parent
        Yeah, that was news to me as well! I was similarly surprised to learn that Cloudflare wasn't managing all of its own hardware. I wonder if that will continue to be true going forward — this was a...

        Yeah, that was news to me as well! I was similarly surprised to learn that Cloudflare wasn't managing all of its own hardware. I wonder if that will continue to be true going forward — this was a pretty bad one!

    4. Greg
      Link Parent
      Yup, I had an unproductive half hour trying to figure out how I’d broken a test script before figuring out that the GCP artefact registry itself was mostly returning 500 errors.

      Yup, I had an unproductive half hour trying to figure out how I’d broken a test script before figuring out that the GCP artefact registry itself was mostly returning 500 errors.

      2 votes
  6. [4]
    cfabbro
    Link
    Mirror: https://archive.is/aauSV
    6 votes
    1. [3]
      Wafik
      Link Parent
      Thanks for the mirror. Was my link throwing up pay walls?

      Thanks for the mirror. Was my link throwing up pay walls?

      2 votes
      1. [2]
        cfabbro
        Link Parent
        I did manage to read it, but it also popped up with: So I assume that means it is actually paywalled and some people might not be able to access it.

        I did manage to read it, but it also popped up with:

        This is your last free article.
        Subscribe today to keep reading. For only $1.50/week, unlock a world of unlimited insights and benefits to help you connect, grow and make an impact.

        So I assume that means it is actually paywalled and some people might not be able to access it.

        3 votes
        1. Wafik
          Link Parent
          Great, appreciate you!

          Great, appreciate you!

          1 vote
  7. Eji1700
    (edited )
    Link
    I seem to either have totally missed this, or be totally insulated. Maaaybe some weird powerbi behavior? I'm sorta shocked because this does seem major.

    I seem to either have totally missed this, or be totally insulated. Maaaybe some weird powerbi behavior? I'm sorta shocked because this does seem major.

    5 votes
  8. JCPhoenix
    Link
    OK I was wondering if some big outage was going on. Was driving home from work and Spotify wasn't working on my phone. At first I thought it was my personal VPN, but turns out I didn't even have...

    OK I was wondering if some big outage was going on. Was driving home from work and Spotify wasn't working on my phone. At first I thought it was my personal VPN, but turns out I didn't even have it on. Then I though it was my cell reception, so I switched my carriers. Nope, was still having problems even on the other carrier. Luckily I have tons of Spotify songs downloaded to my phone.

    4 votes
  9. artvandelay
    Link
    I was affected by this outage earlier today at work. It was an interesting few hours not knowing if the alerts we were seeing were caused by issues on our end or not. I haven't read too much about...

    I was affected by this outage earlier today at work. It was an interesting few hours not knowing if the alerts we were seeing were caused by issues on our end or not. I haven't read too much about this but I read that the reason for this outage was because of a GCP outage, which is also why many of Google's services faced outages. It's always interesting seeing just how interconnected web services are. AWS is known for basically powering half the internet but GCP among other cloud providers are also pretty widely used.

    4 votes
  10. MimicSquid
    Link
    Yeah, my payroll company was affected. Fortunately someone else was able to log in to run payroll this afternoon, but it kept on endlessly loading urgent pages. It's wild how much modern...

    Yeah, my payroll company was affected. Fortunately someone else was able to log in to run payroll this afternoon, but it kept on endlessly loading urgent pages. It's wild how much modern civilization depends on these things that seem held together with chewing gum and baling wire.

    4 votes
  11. tape
    Link
    Ah that explains my discord not logging in on desktop but fine on mobile I bet. TY

    Ah that explains my discord not logging in on desktop but fine on mobile I bet. TY

    2 votes