51
votes
Cloudflare is down causing multiple services to break
Link information
This data is scraped automatically and may be incorrect.
- Title
- Internet Outage: Google, Amazon And More Experience Connection Issues
- Authors
- Antonio Pequeño IV
- Word count
- 73 words
And this is why hospitals shouldn't migrate their workloads to GCP. This is the second major incident in the last 30 days, and both times we've had end users calling our Help Desk asking why our app is down, instead of contacting their own Help Desks because the wait time for hospital IT was too long. 🤔
Healthcare is way too vulnerable to shady IT consultants bragging about how much they can cut costs. We've found that critical apps moved to GCP is a Bad SignTM.
I take it you put stock in AWS instead? I do, as well (literally), but this whole putting critical infrastructure that determines if people live or die into the hands of private companies is not a good path forward.
I learned my trade in the days of company-owned infrastructure. You had at least some on-premises servers, and the "cloud" was an off-premises datacenter (or several geographically scattered ones) where you still owned the physical equipment and rented a private cage.
Downtime did happen. But it was usually mitigated by diversity of systems and software - catastrophes didn't tend to cascade through reliance on single universal tools. There was investment in reliability and redundancy of critical systems (and people!). Patches got staged so all the Citrix gateways didn't go down at once; a subtle Brocade bug only slowed down the most current generation of storage, and so on.
I don't know if the consolidation and homogenization of outsourced AWS, Azure, or Google Cloud services is as widespread outside the U.S. But our corporate tax laws made compute as an operating expense much more favorable than compute as a capital expense. So everything-as-a-service, with subscription costs instead of ownership costs, has taken over. Even though these services are opaque, less reliable, harder to manage properly, less secure...
[Insert old person yells at Clouds here.]
I think there's a healthy middle ground - I've worked for plenty of organisations that can't justify a dedicated sysadmin of any kind, let alone at the level Google or Amazon can afford to hire, so it's very much made sense to treat cloud pricing basically like sysadmin-as-a-service. On demand scaling to load is also one I'd be reluctant to give up!
But it's always gonna hinge on:
Because yeah, you can take pride in a bulletproof multi-cloud terraform setup that'll seamlessly load balance from one to the next, and won't even notice an outage beyond some of those allocation numbers shifting down to zero while others ramp up to compensate.
But you'll only get that after a lot of long conversations with bizdev about how and why we're sticking to nice, clean, industry standard software inside those cloud VMs and not shifting everything to whichever proprietary do-everything service the sales rep was pitching today.
My analytical lens is based in the world of healthcare, which is a vast conglomeration of edge cases. Everything from bog-standard Windows/MS-SQL applications, to weird hybrid PACS [Windows app servers and custom Linux container appliances with fancy storage fabrics and high-capacity optical switches], to ancient VAX/VMS and old VB6 and MS-DOS programs that no one has figured out how to replace, to the endlessly proprietary universes of Epic and other EMRs.
I've worked for a couple of organizations that went gung-ho in trying to get everything off-prem, and laid off most of their in-house expertise. I'll acknowledge that's made me (perhaps excessively) cynical about the results, but I think there's been some willful blindness to the needs of different application and network architectures.
I feel that! It’s so often “let’s uncritically accept the vendor promises and save on paying those pesky salaries” and so rarely “let’s listen to the people we’re paying those salaries to and see how we can make things more efficient in the long term”.
Would it be better to have each hospital with its own data center? Or maybe each town?
That's a hard call. I've worked with enough different U.S. health systems to have observed wildly varying capacity for managing technology services. Some systems are writing their own software and developing new technologies down to the firmware and chips (usually very well funded private university research institutions). Some (usually rural regional hospitals) can barely fund outsourced IT managed from Chennai.
I wouldn't mind seeing an open-source "Datacenter in a Box", the conceptual equivalent of the old "Internet in a Box". This would consist of a modular, standardized shipping container with a given amount of vCPU, vRAM, bandwidth, etc. You could run whatever loads on it you wanted to keep local, but mesh that container with a national network so that spare capacity can be distributed and data stored in geographically redundant locations.
Keep the operating systems minimal, roll updates to fractional portions of the mesh, have regional and national publicly funded management teams... A socialist can dream.
Each hospital should be able to run its essential functions without the internet. It's not happening, but the proper solution would be a socialized basic network.
Are they not already able to practice without the internet? Charting is important, but it's secondary in life-or-death situations.
Without internet, any kind of digital charting can't be trusted since it can't sync across locations. For example, if a patient gets a prescription from a doctor then ends up in the ER, incoherent due to a side effect, the hospital can't trust the medical records have up-to-date prescription information.
It's not just electronic medical records.
I spend a substantial chunk of my time these days on interfaces, so I've had to work with nurse call vendors on hospital-mandated cloud migration projects.
Consider the hubris involved in assuming that your on-premises nurse call system will communicate just fine with its cloud-hosted servers, given the latencies of geographic distance, and firewall policies for all of the (dozens!) of TCP port allows, with various layers of NAT. Guess how reliably those old Windows apps behave with elastic CPU, RAM, and storage provisioning. [Some nurse call vendors are in the process of releasing versions that are designed for PaaS, but the cost of ownership is substantially higher, so it might be a decade or more before they're the norm.]
That's just one safety-critical infrastructure system that's really not well suited for AWS, Azure, or Google Cloud hosting. One project I'm aware of had to roll back to on-prem infrastructure three times, and hasn't migrated successfully yet.
Pretty sure it was me guys.
I just built a flashcard app using react and did a pretty shit job overall. It took me about two hours and I used AI (heavily) to build the prototype version.
It was pretty soon after that firebase went down.
But my Sp.Ed. students are going to fail the social studies final slightly less than they would have otherwise, so I think it was worth a bit of a worldwide outage.
I actually thought it was me, because I just started a new job as a website reliability engineer...
Can I give you a list of dates and times I'd rather not work too hard to attempt this project again?
Absolutely. I'll be off this summer and my plan is to sink so much time and energy into coding educational apps that my students almost don't fail next year. Win win!
Claude and Gemini (AI chatbot agent buzzword things) are down as a result. I almost had to write code like a caveman, but thank the code gods chatGPT still works. (mostly /s, but they are down)
What am I supposed to do, write my unit tests manually?
Cloudflare's postmortem: https://blog.cloudflare.com/cloudflare-service-outage-june-12-2025/
GCP says a full post-mortem for their outage is forthcoming in the next few days. I suspect that will be an interesting read.
It makes sense that they depend on a key value data store somewhere considering the ultra-low latencies they're targeting. Key value data stores are theoretically simple to implement, but they're devilishly complicated to scale and maintain. The complicated design trade-offs are not obvious until you scale to billions of records with global reads and writes.
I can't blame them for using a tried-and-true offering from a well-known cloud platform. Though after this outage, they're certainly going to be speeding up that migration to an internal solution.
Here is Google’s postmortem:
https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW
Null pointer exception caused by unexpected blank fields in a configuration for a quota system, combined with a lack of other safety features that would normally mitigate it: apparently the config came from a Spanner database that was instantly replicated, instead of normal config that would be rolled out gradually. The functionality wasn’t hidden behind a feature flag. A code change was rolled out within 40 minutes, but a lack of exponential backoff caused a thundering herd problem, delaying recovery.
I’m sure all of that will be fixed, and the next time something like this happens it will be in some unrelated system where best practices were somehow ignored.
Huh, that is interesting - like you said above, I'm very surprised that Cloudflare had a dependency on an external provider backing critical services rather than just as a failover for their own in-house systems.
Thanks for posting this.
GCP is also experiencing outages (huge ones):
https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW
Nearly every product in maybe every region? I believe this is why Gemini is down, @hamstergeddon. I think this is unrelated to the Cloudflare outage (or at least, these two service providers don't depend on each other in any way that I'm aware of).
Found this out because both Expo and NPM became unresponsive at the same time. Guess I'm all done with side projects for the day!
Edit: actually, the GCP outage appears to be the root cause of the Claude outage as well! https://status.anthropic.com/
The double outage actually caused most of the internal applications at my work to go down today! It's crazy.
oh wow double outages! I just assumed cloudflare was sitting in front of Claude, although I guess it would be odd if Gemini did the same given that it's google's service. Thanks for the info!
I’m wondering if it was some kind of cascading failure? Presumably a significant Google outage could shift an absolute ton of traffic to CloudFlare (or vice versa) if companies are using both for redundancy.
Ooo maybe. Looking forward to reading the postmortem on this for sure.
Talk about a stressful afternoon! I first noticed when our integration tests started failing which I initially brushed off as a transient error, at least until I retried it multiple times and kept getting the same errors from Google Cloud Storage. Then I got pinged that our Apigee portal was offline...
Thankfully, the Apigee portal was our only outage (the proxying continued working as normal), but it was quite nerve-racking since if anything did go down, we were powerless. We couldn't even SSH into our nodes to manually read our logs, much less remediate anything.
Perhaps the most surprising revelation from all of this is that Cloudflare relies on GCP. Granted, it's not as surprising as if AWS or Azure was (Cloudflare's offerings are nowhere near as extensive, so it makes some sense that they would outsource some tasks), but I always imagined each major cloud provider as being its own silo for whatever reason.
Yeah, that was news to me as well! I was similarly surprised to learn that Cloudflare wasn't managing all of its own hardware. I wonder if that will continue to be true going forward — this was a pretty bad one!
Yup, I had an unproductive half hour trying to figure out how I’d broken a test script before figuring out that the GCP artefact registry itself was mostly returning 500 errors.
Mirror: https://archive.is/aauSV
Thanks for the mirror. Was my link throwing up pay walls?
I did manage to read it, but it also popped up with:
So I assume that means it is actually paywalled and some people might not be able to access it.
Great, appreciate you!
I seem to either have totally missed this, or be totally insulated. Maaaybe some weird powerbi behavior? I'm sorta shocked because this does seem major.
OK I was wondering if some big outage was going on. Was driving home from work and Spotify wasn't working on my phone. At first I thought it was my personal VPN, but turns out I didn't even have it on. Then I though it was my cell reception, so I switched my carriers. Nope, was still having problems even on the other carrier. Luckily I have tons of Spotify songs downloaded to my phone.
I was affected by this outage earlier today at work. It was an interesting few hours not knowing if the alerts we were seeing were caused by issues on our end or not. I haven't read too much about this but I read that the reason for this outage was because of a GCP outage, which is also why many of Google's services faced outages. It's always interesting seeing just how interconnected web services are. AWS is known for basically powering half the internet but GCP among other cloud providers are also pretty widely used.
Yeah, my payroll company was affected. Fortunately someone else was able to log in to run payroll this afternoon, but it kept on endlessly loading urgent pages. It's wild how much modern civilization depends on these things that seem held together with chewing gum and baling wire.
Ah that explains my discord not logging in on desktop but fine on mobile I bet. TY