53 votes

AWS outage impacts

Posted 1 day, 1 hour ago by TheMediumJon

Tags: amazon.web services, usa, infrastructure, outages, apps, updates.live, author.marita moloney, author.dulcie lee, source.bbc

https://www.bbc.com/news/live/c5y8k7k6v1rt

Link information

This data is scraped automatically and may be incorrect.

Title: Snapchat and Duolingo among major apps down in Amazon Web Services outage - live updates
Published: Oct 20 2025
Word count: 688 words

30 comments

zestier
19 hours, 23 minutes ago (edited 18 hours, 40 minutes ago)
Link
People talk a lot about the over reliance on AWS and how its crazy how much of the internet goes down when AWS does. What I want to talk about is AWS's over reliance on itself. There are a few...

People talk a lot about the over reliance on AWS and how its crazy how much of the internet goes down when AWS does. What I want to talk about is AWS's over reliance on itself.

There are a few core services that seem somewhat unavoidable to depend on. For example, if you're making a service you probably need compute and compute is allocated by EC2. That's not great, but its fine for as long as those cases remain limited to the lowest level hardware abstractions that are going to be supported because having multiple independent hardware coordinators sounds like it would just make a bigger mess.

The bit that's gone insane though is higher level interdependencies, often circularly. Not only does this make bootstrapping new data centers and resolving outages harder, it makes an outage cascade further. These dependencies should be broken even if it means a bit more operational cost. At the very least they should be broken for lower level services to reduce the number of services that are capable of cascading their failures all the way down to core services.

To frame it from the perspective of this outage, EC2 and Cloudwatch depending on DDB is insane. Depending on the DDB code is fine, but depending on the same instances is insane because it means they cascade fail. At the very least they should have independent deployments so that DDB going down doesn't take down core services like EC2 (a service DDB itself likely depends on for it's own compute!). The fact that a DDB outage kills the monitoring tools in Cloudwatch is ridiculous. Monitoring tools need to be independent so that you can, you know, monitor the impact of the outage on your infrastructure.

Also, you're welcome to anyone that started having stuff work right around 2:30 PST. I was up late last night waiting to monitor an annoyingly timed scheduled deployment until I gave up and went to bed around 2:20. According to communications I went to bed just a few minutes too early to see my stuff come up, which obviously means my overnight YOLO deploy during an outage (since I didn't bother to roll it back) was important to recovery.

18 votes
[16]
TheMediumJon (OP)
1 day, 1 hour ago
Link
So apparently the US-East AWS servers are having issues.... In the spirit of discussion? How's this impacting you, if at all? Personally, slack as well as a bunch of other company services are...

So apparently the US-East AWS servers are having issues....

In the spirit of discussion? How's this impacting you, if at all?

Personally, slack as well as a bunch of other company services are barely functional or not at all. Great fun.

17 votes
1. [3]
  Ozzy
  1 day, 1 hour ago
  Link Parent
  Can’t access Signal, can’t access my bank account, can’t access a couple more stuff. I’m in Europe, damn you, AWS.
  
  Can’t access Signal, can’t access my bank account, can’t access a couple more stuff. I’m in Europe, damn you, AWS.
  
  14 votes
  1. Greg
    1 day, 1 hour ago
    Link Parent
    Ohhhh, that's why Signal says I'm offline! Glad I don't have to go digging into firewall settings or anything.
    
    Ohhhh, that's why Signal says I'm offline! Glad I don't have to go digging into firewall settings or anything.
    
    10 votes
  2. TheMediumJon (OP)
    1 day, 1 hour ago
    Link Parent
    Oh man, I've not tried anything personal at all yet, actually.
    
    Oh man, I've not tried anything personal at all yet, actually.
    
    5 votes
2. redwall_hp
  22 hours, 45 minutes ago
  Link Parent
  I've barely slept. I got paged over and over all night from the metric monitors on my team's apps freaking out from zero traffic in our east region. Our product is chugging along, though, because...
  
  I've barely slept. I got paged over and over all night from the metric monitors on my team's apps freaking out from zero traffic in our east region. Our product is chugging along, though, because we did a region failover.
  
  Basically a cycle of trying to nap for a few minutes, acknowledging a page, confirming it's the same thing, playing mute whack-a-mole on the monitor, resolving the page, catching up on what other teams are doing for the mitigation, repeat.
  
  And Slack is slow.
  
  12 votes
3. [3]
  Macha
  1 day, 1 hour ago
  Link Parent
  No docker, no npm, slack is a mess. IAM is causing problems in work even though we’re in eu regions
  
  No docker, no npm, slack is a mess.
  
  IAM is causing problems in work even though we’re in eu regions
  
  11 votes
  1. [2]
    Crestwave
    1 day ago
    Link Parent
    It's quite illuminating seeing how much infrastructure all over the world was affected from the outage of a single us-east-1 service. Hopefully some companies will consider lessening their...
    
    It's quite illuminating seeing how much infrastructure all over the world was affected from the outage of a single us-east-1 service. Hopefully some companies will consider lessening their dependency on a single region.
    
    16 votes
    
    CptBluebear
    1 day ago
    Link Parent
    Hahaha, that's a good one!
    
    Hahaha, that's a good one!
    
    15 votes
4. shrike
  1 day, 1 hour ago
  Link Parent
  Docker is down, Slack is slow. I'm on a long lunch break.
  
  Docker is down, Slack is slow.
  
  I'm on a long lunch break.
  
  7 votes
5. blackforest
  1 day, 1 hour ago
  Link Parent
  I can’t send or receive pictures on Snapchat since this morning. Messages on group chats work, but every sender is now marked as “Unknown Snapchatter” rather than their username, so I can’t follow...
  
  I can’t send or receive pictures on Snapchat since this morning. Messages on group chats work, but every sender is now marked as “Unknown Snapchatter” rather than their username, so I can’t follow the conversation.
  
  6 votes
6. [3]
  Fal
  17 hours, 37 minutes ago
  Link Parent
  Canvas is down I HAVE HOMEWORK TO SUBMIT AAAAAAAAAA
  
  Canvas is down I HAVE HOMEWORK TO SUBMIT AAAAAAAAAA
  
  6 votes
  1. [2]
    CannibalisticApple
    17 hours, 33 minutes ago
    Link Parent
    I feel like that has to be forgiveable by the instructors if it's not back up before the deadline.
    
    I feel like that has to be forgiveable by the instructors if it's not back up before the deadline.
    
    9 votes
    
    redwall_hp
    17 hours, 29 minutes ago
    Link Parent
    "Amazon ate my homework."
    
    "Amazon ate my homework."
    
    17 votes
7. erithaea
  1 day ago
  Link Parent
  Atlassian auth services are/were down, so I couldn't access Jira at work for the entire morning. It's working again now (or at least it worked for long enough to allow me to log in) but since...
  
  Atlassian auth services are/were down, so I couldn't access Jira at work for the entire morning. It's working again now (or at least it worked for long enough to allow me to log in) but since we're currently in a QA period where we have to create and review a lot of tickets, it was a pretty big showstopper.
  
  4 votes
8. ras
  17 hours, 54 minutes ago
  Link Parent
  really wild how long this has gone on. i just got back from taking my son to the ENT and their payment processing was down due to the outage. getting close to 11 hours since the first reports.
  
  really wild how long this has gone on. i just got back from taking my son to the ENT and their payment processing was down due to the outage. getting close to 11 hours since the first reports.
  
  1 vote
9. lou
  16 hours, 57 minutes ago
  Link Parent
  Mercado Livre, Brazil's Ebay, cannot show when my delivery is arriving.
  
  Mercado Livre, Brazil's Ebay, cannot show when my delivery is arriving.
trim
1 day, 1 hour ago
Link
I work in tech deploying software to customers through AWS. It's been a morning. Fortunately not too much impact on eu-west but global services and our access to console, and monitoring were...

I work in tech deploying software to customers through AWS. It's been a morning. Fortunately not too much impact on eu-west but global services and our access to console, and monitoring were affected. Kind of like being in the dark whilst fielding questions we can't answer about metrics we can't see.

15 votes
TaylorSwiftsPickles
1 day, 1 hour ago
Link
Almost no issues on my end. Everything's worked normally and I wouldn't even have noticed if 3 people didn't try to contact me on signal when it was down. Works normally now.

Almost no issues on my end. Everything's worked normally and I wouldn't even have noticed if 3 people didn't try to contact me on signal when it was down. Works normally now.

6 votes
hamstergeddon
22 hours, 17 minutes ago (edited 20 hours, 35 minutes ago)
Link
It's times like these that I'm glad I never segued into devops. But then I'd have a bit more money to cry into about it if I had. Edit- as it stands I can't access my Jira board which is going to...

It's times like these that I'm glad I never segued into devops. But then I'd have a bit more money to cry into about it if I had.

Edit- as it stands I can't access my Jira board which is going to make for an interesting stand up meeting in about 15 minutes :)

5 votes
redwall_hp
12 hours, 56 minutes ago
Link
Fun take from The Register: Today is when the Amazon brain drain finally sent AWS down the spout Good to call out the three solid years of layoffs and forced attrition through return-to-office...

Fun take from The Register: Today is when the Amazon brain drain finally sent AWS down the spout

Good to call out the three solid years of layoffs and forced attrition through return-to-office policies at Amazon...

5 votes
zestier
11 hours, 33 minutes ago
Link
Since it is marked as resolved here's probably the final update until they release a full writeup.

Since it is marked as resolved here's probably the final update until they release a full writeup.

Oct 20 3:53 PM PDT Between 11:49 PM PDT on October 19 and 2:24 AM PDT on October 20, we experienced increased error rates and latencies for AWS Services in the US-EAST-1 Region. Additionally, services or features that rely on US-EAST-1 endpoints such as IAM and DynamoDB Global Tables also experienced issues during this time. At 12:26 AM on October 20, we identified the trigger of the event as DNS resolution issues for the regional DynamoDB service endpoints. After resolving the DynamoDB DNS issue at 2:24 AM, services began recovering but we had a subsequent impairment in the internal subsystem of EC2 that is responsible for launching EC2 instances due to its dependency on DynamoDB. As we continued to work through EC2 instance launch impairments, Network Load Balancer health checks also became impaired, resulting in network connectivity issues in multiple services such as Lambda, DynamoDB, and CloudWatch. We recovered the Network Load Balancer health checks at 9:38 AM. As part of the recovery effort, we temporarily throttled some operations such as EC2 instance launches, processing of SQS queues via Lambda Event Source Mappings, and asynchronous Lambda invocations. Over time we reduced throttling of operations and worked in parallel to resolve network connectivity issues until the services fully recovered. By 3:01 PM, all AWS services returned to normal operations. Some services such as AWS Config, Redshift, and Connect continue to have a backlog of messages that they will finish processing over the next few hours. We will share a detailed AWS post-event summary.

3 votes
[3]
Eji1700
19 hours, 20 minutes ago
Link
As an shop that uses azure instead of AWS, bullet dodged internally BUT naturally our 3rd party vendors use AWS, so still a mess from upstream data.

As an shop that uses azure instead of AWS, bullet dodged internally BUT naturally our 3rd party vendors use AWS, so still a mess from upstream data.

2 votes
1. [2]
  tanglisha
  13 hours, 51 minutes ago
  Link Parent
  Here’s an article from 2019 by someone who tried to completely block Amazon from their life.
  
  Here’s an article from 2019 by someone who tried to completely block Amazon from their life.
  
  5 votes
  1. CannibalisticApple
    12 hours ago
    Link Parent
    Wow. That was a fascinating read, and probably worthy of its own individual topic. It highlights Amazon really is too powerful given how many services it can impact.
    
    Wow. That was a fascinating read, and probably worthy of its own individual topic. It highlights Amazon really is too powerful given how many services it can impact.
    
    3 votes
[5]
TheFireTheft
19 hours, 11 minutes ago
Link
I couldn't find anything in the article about the root cause, but let me guess: Config related? Did somebody mess up a YAML file?

I couldn't find anything in the article about the root cause, but let me guess:

Config related?

Did somebody mess up a YAML file?

2 votes
1. [4]
  zestier
  19 hours, 7 minutes ago (edited 18 hours, 48 minutes ago)
  Link Parent
  https://health.aws.amazon.com/health/status has a handful of updates. The one that most directly answers your question is I'm reasonably confident that the teams in charge of DDB are in Seattle so...
  
  https://health.aws.amazon.com/health/status has a handful of updates. The one that most directly answers your question is
  
  Oct 20 2:01 AM PDT We have identified a potential root cause for error rates for the DynamoDB APIs in the US-EAST-1 Region. Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovery. This issue also affects other AWS Services in the US-EAST-1 Region. Global services or features that rely on US-EAST-1 endpoints such as IAM updates and DynamoDB Global tables may also be experiencing issues. During this time, customers may be unable to create or update Support Cases. We recommend customers continue to retry any failed requests. We will continue to provide updates as we have more information to share, or by 2:45 AM.
  
  I'm reasonably confident that the teams in charge of DDB are in Seattle so very unlikely they were doing any night time deployments of any kind (especially on a Sunday night).
  
  3 votes
  1. [3]
    first-must-burn
    19 hours, 3 minutes ago
    Link Parent
    It's always DNS.
    
    It's always DNS.
    
    11 votes
    
    [2]
    trim
    17 hours, 11 minutes ago
    Link Parent
    BGP enters stage left, glancing furtively.
    
    BGP enters stage left, glancing furtively.
    
    4 votes
    
    first-must-burn
    14 hours, 9 minutes ago
    Link Parent
    DNS stands on BGP's shoulders in a trench coat, and they try to pass as an application level error.
    
    DNS stands on BGP's shoulders in a trench coat, and they try to pass as an application level error.
    
    3 votes