Cloudflare is down causing multiple services to break

[11]

patience_limited

June 12 (edited June 12)

Link

And this is why hospitals shouldn't migrate their workloads to GCP. This is the second major incident in the last 30 days, and both times we've had end users calling our Help Desk asking why our...

And this is why hospitals shouldn't migrate their workloads to GCP. This is the second major incident in the last 30 days, and both times we've had end users calling our Help Desk asking why our app is down, instead of contacting their own Help Desks because the wait time for hospital IT was too long. 🤔

Healthcare is way too vulnerable to shady IT consultants bragging about how much they can cut costs. We've found that critical apps moved to GCP is a Bad Sign^TM.

23 votes

[10]
okiyama
June 13
Link Parent
I take it you put stock in AWS instead? I do, as well (literally), but this whole putting critical infrastructure that determines if people live or die into the hands of private companies is not a...

I take it you put stock in AWS instead? I do, as well (literally), but this whole putting critical infrastructure that determines if people live or die into the hands of private companies is not a good path forward.

7 votes
1. [4]
  patience_limited
  June 13
  Link Parent
  I learned my trade in the days of company-owned infrastructure. You had at least some on-premises servers, and the "cloud" was an off-premises datacenter (or several geographically scattered ones)...
  
  I learned my trade in the days of company-owned infrastructure. You had at least some on-premises servers, and the "cloud" was an off-premises datacenter (or several geographically scattered ones) where you still owned the physical equipment and rented a private cage.
  
  Downtime did happen. But it was usually mitigated by diversity of systems and software - catastrophes didn't tend to cascade through reliance on single universal tools. There was investment in reliability and redundancy of critical systems (and people!). Patches got staged so all the Citrix gateways didn't go down at once; a subtle Brocade bug only slowed down the most current generation of storage, and so on.
  
  I don't know if the consolidation and homogenization of outsourced AWS, Azure, or Google Cloud services is as widespread outside the U.S. But our corporate tax laws made compute as an operating expense much more favorable than compute as a capital expense. So everything-as-a-service, with subscription costs instead of ownership costs, has taken over. Even though these services are opaque, less reliable, harder to manage properly, less secure...
  
  [Insert old person yells at Clouds here.]
  
  11 votes
  1. [3]
    Greg
    June 13
    Link Parent
    I think there's a healthy middle ground - I've worked for plenty of organisations that can't justify a dedicated sysadmin of any kind, let alone at the level Google or Amazon can afford to hire,...
    
    I think there's a healthy middle ground - I've worked for plenty of organisations that can't justify a dedicated sysadmin of any kind, let alone at the level Google or Amazon can afford to hire, so it's very much made sense to treat cloud pricing basically like sysadmin-as-a-service. On demand scaling to load is also one I'd be reluctant to give up!
    
    But it's always gonna hinge on:
    
    There was investment in reliability and redundancy of critical systems (and people!)
    
    Because yeah, you can take pride in a bulletproof multi-cloud terraform setup that'll seamlessly load balance from one to the next, and won't even notice an outage beyond some of those allocation numbers shifting down to zero while others ramp up to compensate.
    
    But you'll only get that after a lot of long conversations with bizdev about how and why we're sticking to nice, clean, industry standard software inside those cloud VMs and not shifting everything to whichever proprietary do-everything service the sales rep was pitching today.
    
    3 votes
    
    [2]
    patience_limited
    June 13
    Link Parent
    My analytical lens is based in the world of healthcare, which is a vast conglomeration of edge cases. Everything from bog-standard Windows/MS-SQL applications, to weird hybrid PACS [Windows app...
    
    My analytical lens is based in the world of healthcare, which is a vast conglomeration of edge cases. Everything from bog-standard Windows/MS-SQL applications, to weird hybrid PACS [Windows app servers and custom Linux container appliances with fancy storage fabrics and high-capacity optical switches], to ancient VAX/VMS and old VB6 and MS-DOS programs that no one has figured out how to replace, to the endlessly proprietary universes of Epic and other EMRs.
    
    I've worked for a couple of organizations that went gung-ho in trying to get everything off-prem, and laid off most of their in-house expertise. I'll acknowledge that's made me (perhaps excessively) cynical about the results, but I think there's been some willful blindness to the needs of different application and network architectures.
    
    5 votes
    
    Greg
    June 13
    Link Parent
    I feel that! It’s so often “let’s uncritically accept the vendor promises and save on paying those pesky salaries” and so rarely “let’s listen to the people we’re paying those salaries to and see...
    
    I feel that! It’s so often “let’s uncritically accept the vendor promises and save on paying those pesky salaries” and so rarely “let’s listen to the people we’re paying those salaries to and see how we can make things more efficient in the long term”.
    
    2 votes
2. [5]
  teaearlgraycold
  June 13
  Link Parent
  Would it be better to have each hospital with its own data center? Or maybe each town?
  
  Would it be better to have each hospital with its own data center? Or maybe each town?
  
  2 votes
  1. patience_limited
    June 13
    Link Parent
    That's a hard call. I've worked with enough different U.S. health systems to have observed wildly varying capacity for managing technology services. Some systems are writing their own software and...
    
    That's a hard call. I've worked with enough different U.S. health systems to have observed wildly varying capacity for managing technology services. Some systems are writing their own software and developing new technologies down to the firmware and chips (usually very well funded private university research institutions). Some (usually rural regional hospitals) can barely fund outsourced IT managed from Chennai.
    
    I wouldn't mind seeing an open-source "Datacenter in a Box", the conceptual equivalent of the old "Internet in a Box". This would consist of a modular, standardized shipping container with a given amount of vCPU, vRAM, bandwidth, etc. You could run whatever loads on it you wanted to keep local, but mesh that container with a national network so that spare capacity can be distributed and data stored in geographically redundant locations.
    
    Keep the operating systems minimal, roll updates to fractional portions of the mesh, have regional and national publicly funded management teams... A socialist can dream.
    
    8 votes
  2. [3]
    okiyama
    June 13
    Link Parent
    Each hospital should be able to run its essential functions without the internet. It's not happening, but the proper solution would be a socialized basic network.
    
    Each hospital should be able to run its essential functions without the internet. It's not happening, but the proper solution would be a socialized basic network.
    
    7 votes
    
    [2]
    Minori
    June 13
    Link Parent
    Are they not already able to practice without the internet? Charting is important, but it's secondary in life-or-death situations. Without internet, any kind of digital charting can't be trusted...
    
    Are they not already able to practice without the internet? Charting is important, but it's secondary in life-or-death situations.
    
    Without internet, any kind of digital charting can't be trusted since it can't sync across locations. For example, if a patient gets a prescription from a doctor then ends up in the ER, incoherent due to a side effect, the hospital can't trust the medical records have up-to-date prescription information.
    
    3 votes
    
    patience_limited
    June 13
    Link Parent
    It's not just electronic medical records. I spend a substantial chunk of my time these days on interfaces, so I've had to work with nurse call vendors on hospital-mandated cloud migration...
    
    It's not just electronic medical records.
    
    I spend a substantial chunk of my time these days on interfaces, so I've had to work with nurse call vendors on hospital-mandated cloud migration projects.
    
    Consider the hubris involved in assuming that your on-premises nurse call system will communicate just fine with its cloud-hosted servers, given the latencies of geographic distance, and firewall policies for all of the (dozens!) of TCP port allows, with various layers of NAT. Guess how reliably those old Windows apps behave with elastic CPU, RAM, and storage provisioning. [Some nurse call vendors are in the process of releasing versions that are designed for PaaS, but the cost of ownership is substantially higher, so it might be a decade or more before they're the norm.]
    
    That's just one safety-critical infrastructure system that's really not well suited for AWS, Azure, or Google Cloud hosting. One project I'm aware of had to roll back to on-prem infrastructure three times, and hasn't migrated successfully yet.
    
    8 votes

[4]

Wolf_359

June 12

Link

Pretty sure it was me guys. I just built a flashcard app using react and did a pretty shit job overall. It took me about two hours and I used AI (heavily) to build the prototype version. It was...

Pretty sure it was me guys.

I just built a flashcard app using react and did a pretty shit job overall. It took me about two hours and I used AI (heavily) to build the prototype version.

It was pretty soon after that firebase went down.

But my Sp.Ed. students are going to fail the social studies final slightly less than they would have otherwise, so I think it was worth a bit of a worldwide outage.

27 votes

rogue_cricket
June 13
Link Parent
I actually thought it was me, because I just started a new job as a website reliability engineer...

I actually thought it was me, because I just started a new job as a website reliability engineer...

9 votes
[2]
Eji1700
June 13
Link Parent
Can I give you a list of dates and times I'd rather not work too hard to attempt this project again?

Can I give you a list of dates and times I'd rather not work too hard to attempt this project again?

7 votes
1. Wolf_359
  June 13
  Link Parent
  Absolutely. I'll be off this summer and my plan is to sink so much time and energy into coding educational apps that my students almost don't fail next year. Win win!
  
  Absolutely. I'll be off this summer and my plan is to sink so much time and energy into coding educational apps that my students almost don't fail next year. Win win!
  
  3 votes

[2]

hamstergeddon

June 12

Link

Claude and Gemini (AI chatbot agent buzzword things) are down as a result. I almost had to write code like a caveman, but thank the code gods chatGPT still works. (mostly /s, but they are down)

Claude and Gemini (AI chatbot agent buzzword things) are down as a result. I almost had to write code like a caveman, but thank the code gods chatGPT still works. (mostly /s, but they are down)

11 votes

jonah
June 12
Link Parent
What am I supposed to do, write my unit tests manually?

What am I supposed to do, write my unit tests manually?

14 votes

[5]

Luna

June 13

Link

Cloudflare's postmortem: https://blog.cloudflare.com/cloudflare-service-outage-june-12-2025/ GCP says a full post-mortem for their outage is forthcoming in the next few days. I suspect that will...

Cloudflare's postmortem: https://blog.cloudflare.com/cloudflare-service-outage-june-12-2025/

GCP says a full post-mortem for their outage is forthcoming in the next few days. I suspect that will be an interesting read.

10 votes

Minori
June 14
Link Parent
It makes sense that they depend on a key value data store somewhere considering the ultra-low latencies they're targeting. Key value data stores are theoretically simple to implement, but they're...

Workers KV today relies on a central data store to provide a source of truth for data. A failure of that store caused a complete outage for cold reads and writes to the KV namespaces used by services across Cloudflare.

Workers KV is in the process of being transitioned to significantly more resilient infrastructure for its central store: regrettably, we had a gap in coverage which was exposed during this incident.

It makes sense that they depend on a key value data store somewhere considering the ultra-low latencies they're targeting. Key value data stores are theoretically simple to implement, but they're devilishly complicated to scale and maintain. The complicated design trade-offs are not obvious until you scale to billions of records with global reads and writes.

I can't blame them for using a tried-and-true offering from a well-known cloud platform. Though after this outage, they're certainly going to be speeding up that migration to an internal solution.

4 votes
skybrian
June 14
Link Parent
Here is Google’s postmortem: https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW Null pointer exception caused by unexpected blank fields in a configuration for a quota system, combined...

Here is Google’s postmortem:

https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW

Null pointer exception caused by unexpected blank fields in a configuration for a quota system, combined with a lack of other safety features that would normally mitigate it: apparently the config came from a Spanner database that was instantly replicated, instead of normal config that would be rolled out gradually. The functionality wasn’t hidden behind a feature flag. A code change was rolled out within 40 minutes, but a lack of exponential backoff caused a thundering herd problem, delaying recovery.

I’m sure all of that will be fixed, and the next time something like this happens it will be in some unrelated system where best practices were somehow ignored.

4 votes
Greg
June 13
Link Parent
Huh, that is interesting - like you said above, I'm very surprised that Cloudflare had a dependency on an external provider backing critical services rather than just as a failover for their own...

Huh, that is interesting - like you said above, I'm very surprised that Cloudflare had a dependency on an external provider backing critical services rather than just as a failover for their own in-house systems.

2 votes
Wafik (OP)
June 13
Link Parent
Thanks for posting this.

Thanks for posting this.

1 vote

[8]

smores

June 12 (edited June 12)

Link

GCP is also experiencing outages (huge ones): https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW Nearly every product in maybe every region? I believe this is why Gemini is down,...

GCP is also experiencing outages (huge ones):

https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW

Nearly every product in maybe every region? I believe this is why Gemini is down, @hamstergeddon. I think this is unrelated to the Cloudflare outage (or at least, these two service providers don't depend on each other in any way that I'm aware of).

Found this out because both Expo and NPM became unresponsive at the same time. Guess I'm all done with side projects for the day!

Edit: actually, the GCP outage appears to be the root cause of the Claude outage as well! https://status.anthropic.com/

9 votes

Akir
June 12
Link Parent
The double outage actually caused most of the internal applications at my work to go down today! It's crazy.

The double outage actually caused most of the internal applications at my work to go down today! It's crazy.

4 votes
[3]
hamstergeddon
June 12
Link Parent
oh wow double outages! I just assumed cloudflare was sitting in front of Claude, although I guess it would be odd if Gemini did the same given that it's google's service. Thanks for the info!

oh wow double outages! I just assumed cloudflare was sitting in front of Claude, although I guess it would be odd if Gemini did the same given that it's google's service. Thanks for the info!

3 votes
1. [2]
  Greg
  June 12
  Link Parent
  I’m wondering if it was some kind of cascading failure? Presumably a significant Google outage could shift an absolute ton of traffic to CloudFlare (or vice versa) if companies are using both for...
  
  oh wow double outages!
  
  I’m wondering if it was some kind of cascading failure? Presumably a significant Google outage could shift an absolute ton of traffic to CloudFlare (or vice versa) if companies are using both for redundancy.
  
  7 votes
  1. hamstergeddon
    June 12
    Link Parent
    Ooo maybe. Looking forward to reading the postmortem on this for sure.
    
    Ooo maybe. Looking forward to reading the postmortem on this for sure.
    
    5 votes
[2]
Luna
June 13
Link Parent
Talk about a stressful afternoon! I first noticed when our integration tests started failing which I initially brushed off as a transient error, at least until I retried it multiple times and kept...

Talk about a stressful afternoon! I first noticed when our integration tests started failing which I initially brushed off as a transient error, at least until I retried it multiple times and kept getting the same errors from Google Cloud Storage. Then I got pinged that our Apigee portal was offline...

Thankfully, the Apigee portal was our only outage (the proxying continued working as normal), but it was quite nerve-racking since if anything did go down, we were powerless. We couldn't even SSH into our nodes to manually read our logs, much less remediate anything.

Perhaps the most surprising revelation from all of this is that Cloudflare relies on GCP. Granted, it's not as surprising as if AWS or Azure was (Cloudflare's offerings are nowhere near as extensive, so it makes some sense that they would outsource some tasks), but I always imagined each major cloud provider as being its own silo for whatever reason.

3 votes
1. smores
  June 13
  Link Parent
  Yeah, that was news to me as well! I was similarly surprised to learn that Cloudflare wasn't managing all of its own hardware. I wonder if that will continue to be true going forward — this was a...
  
  Yeah, that was news to me as well! I was similarly surprised to learn that Cloudflare wasn't managing all of its own hardware. I wonder if that will continue to be true going forward — this was a pretty bad one!
Greg
June 12
Link Parent
Yup, I had an unproductive half hour trying to figure out how I’d broken a test script before figuring out that the GCP artefact registry itself was mostly returning 500 errors.

Yup, I had an unproductive half hour trying to figure out how I’d broken a test script before figuring out that the GCP artefact registry itself was mostly returning 500 errors.

2 votes

[4]

cfabbro

June 12

Link

Mirror: https://archive.is/aauSV

6 votes

[3]
Wafik (OP)
June 13
Link Parent
Thanks for the mirror. Was my link throwing up pay walls?

Thanks for the mirror. Was my link throwing up pay walls?

2 votes
1. [2]
  cfabbro
  June 13
  Link Parent
  I did manage to read it, but it also popped up with: So I assume that means it is actually paywalled and some people might not be able to access it.
  
  I did manage to read it, but it also popped up with:
  
  This is your last free article.
  Subscribe today to keep reading. For only $1.50/week, unlock a world of unlimited insights and benefits to help you connect, grow and make an impact.
  
  So I assume that means it is actually paywalled and some people might not be able to access it.
  
  3 votes
  1. Wafik (OP)
    June 13
    Link Parent
    Great, appreciate you!
    
    Great, appreciate you!
    
    1 vote

Eji1700

June 12 (edited June 12)

Link

I seem to either have totally missed this, or be totally insulated. Maaaybe some weird powerbi behavior? I'm sorta shocked because this does seem major.

5 votes

JCPhoenix

June 12

Link

OK I was wondering if some big outage was going on. Was driving home from work and Spotify wasn't working on my phone. At first I thought it was my personal VPN, but turns out I didn't even have...

OK I was wondering if some big outage was going on. Was driving home from work and Spotify wasn't working on my phone. At first I thought it was my personal VPN, but turns out I didn't even have it on. Then I though it was my cell reception, so I switched my carriers. Nope, was still having problems even on the other carrier. Luckily I have tons of Spotify songs downloaded to my phone.

4 votes

artvandelay

June 13

Link

I was affected by this outage earlier today at work. It was an interesting few hours not knowing if the alerts we were seeing were caused by issues on our end or not. I haven't read too much about...

I was affected by this outage earlier today at work. It was an interesting few hours not knowing if the alerts we were seeing were caused by issues on our end or not. I haven't read too much about this but I read that the reason for this outage was because of a GCP outage, which is also why many of Google's services faced outages. It's always interesting seeing just how interconnected web services are. AWS is known for basically powering half the internet but GCP among other cloud providers are also pretty widely used.

4 votes

MimicSquid

June 13

Link

Yeah, my payroll company was affected. Fortunately someone else was able to log in to run payroll this afternoon, but it kept on endlessly loading urgent pages. It's wild how much modern...

Yeah, my payroll company was affected. Fortunately someone else was able to log in to run payroll this afternoon, but it kept on endlessly loading urgent pages. It's wild how much modern civilization depends on these things that seem held together with chewing gum and baling wire.

4 votes

tape

June 12

Link

Ah that explains my discord not logging in on desktop but fine on mobile I bet. TY

2 votes

Link information

39 comments