53 votes

AWS outage impacts

30 comments

  1. zestier
    (edited )
    Link
    People talk a lot about the over reliance on AWS and how its crazy how much of the internet goes down when AWS does. What I want to talk about is AWS's over reliance on itself. There are a few...

    People talk a lot about the over reliance on AWS and how its crazy how much of the internet goes down when AWS does. What I want to talk about is AWS's over reliance on itself.

    There are a few core services that seem somewhat unavoidable to depend on. For example, if you're making a service you probably need compute and compute is allocated by EC2. That's not great, but its fine for as long as those cases remain limited to the lowest level hardware abstractions that are going to be supported because having multiple independent hardware coordinators sounds like it would just make a bigger mess.

    The bit that's gone insane though is higher level interdependencies, often circularly. Not only does this make bootstrapping new data centers and resolving outages harder, it makes an outage cascade further. These dependencies should be broken even if it means a bit more operational cost. At the very least they should be broken for lower level services to reduce the number of services that are capable of cascading their failures all the way down to core services.

    To frame it from the perspective of this outage, EC2 and Cloudwatch depending on DDB is insane. Depending on the DDB code is fine, but depending on the same instances is insane because it means they cascade fail. At the very least they should have independent deployments so that DDB going down doesn't take down core services like EC2 (a service DDB itself likely depends on for it's own compute!). The fact that a DDB outage kills the monitoring tools in Cloudwatch is ridiculous. Monitoring tools need to be independent so that you can, you know, monitor the impact of the outage on your infrastructure.


    Also, you're welcome to anyone that started having stuff work right around 2:30 PST. I was up late last night waiting to monitor an annoyingly timed scheduled deployment until I gave up and went to bed around 2:20. According to communications I went to bed just a few minutes too early to see my stuff come up, which obviously means my overnight YOLO deploy during an outage (since I didn't bother to roll it back) was important to recovery.

    18 votes
  2. [16]
    TheMediumJon
    Link
    So apparently the US-East AWS servers are having issues.... In the spirit of discussion? How's this impacting you, if at all? Personally, slack as well as a bunch of other company services are...

    So apparently the US-East AWS servers are having issues....

    In the spirit of discussion? How's this impacting you, if at all?

    Personally, slack as well as a bunch of other company services are barely functional or not at all. Great fun.

    17 votes
    1. [3]
      Ozzy
      Link Parent
      Can’t access Signal, can’t access my bank account, can’t access a couple more stuff. I’m in Europe, damn you, AWS.

      Can’t access Signal, can’t access my bank account, can’t access a couple more stuff. I’m in Europe, damn you, AWS.

      14 votes
      1. Greg
        Link Parent
        Ohhhh, that's why Signal says I'm offline! Glad I don't have to go digging into firewall settings or anything.

        Ohhhh, that's why Signal says I'm offline! Glad I don't have to go digging into firewall settings or anything.

        10 votes
      2. TheMediumJon
        Link Parent
        Oh man, I've not tried anything personal at all yet, actually.

        Oh man, I've not tried anything personal at all yet, actually.

        5 votes
    2. redwall_hp
      Link Parent
      I've barely slept. I got paged over and over all night from the metric monitors on my team's apps freaking out from zero traffic in our east region. Our product is chugging along, though, because...

      I've barely slept. I got paged over and over all night from the metric monitors on my team's apps freaking out from zero traffic in our east region. Our product is chugging along, though, because we did a region failover.

      Basically a cycle of trying to nap for a few minutes, acknowledging a page, confirming it's the same thing, playing mute whack-a-mole on the monitor, resolving the page, catching up on what other teams are doing for the mitigation, repeat.

      And Slack is slow.

      12 votes
    3. [3]
      Macha
      Link Parent
      No docker, no npm, slack is a mess. IAM is causing problems in work even though we’re in eu regions

      No docker, no npm, slack is a mess.

      IAM is causing problems in work even though we’re in eu regions

      11 votes
      1. [2]
        Crestwave
        Link Parent
        It's quite illuminating seeing how much infrastructure all over the world was affected from the outage of a single us-east-1 service. Hopefully some companies will consider lessening their...

        It's quite illuminating seeing how much infrastructure all over the world was affected from the outage of a single us-east-1 service. Hopefully some companies will consider lessening their dependency on a single region.

        16 votes
    4. shrike
      Link Parent
      Docker is down, Slack is slow. I'm on a long lunch break.

      Docker is down, Slack is slow.

      I'm on a long lunch break.

      7 votes
    5. blackforest
      Link Parent
      I can’t send or receive pictures on Snapchat since this morning. Messages on group chats work, but every sender is now marked as “Unknown Snapchatter” rather than their username, so I can’t follow...

      I can’t send or receive pictures on Snapchat since this morning. Messages on group chats work, but every sender is now marked as “Unknown Snapchatter” rather than their username, so I can’t follow the conversation.

      6 votes
    6. [3]
      Fal
      Link Parent
      Canvas is down I HAVE HOMEWORK TO SUBMIT AAAAAAAAAA

      Canvas is down I HAVE HOMEWORK TO SUBMIT AAAAAAAAAA

      6 votes
      1. [2]
        CannibalisticApple
        Link Parent
        I feel like that has to be forgiveable by the instructors if it's not back up before the deadline.

        I feel like that has to be forgiveable by the instructors if it's not back up before the deadline.

        9 votes
        1. redwall_hp
          Link Parent
          "Amazon ate my homework."

          "Amazon ate my homework."

          17 votes
    7. erithaea
      Link Parent
      Atlassian auth services are/were down, so I couldn't access Jira at work for the entire morning. It's working again now (or at least it worked for long enough to allow me to log in) but since...

      Atlassian auth services are/were down, so I couldn't access Jira at work for the entire morning. It's working again now (or at least it worked for long enough to allow me to log in) but since we're currently in a QA period where we have to create and review a lot of tickets, it was a pretty big showstopper.

      4 votes
    8. ras
      Link Parent
      really wild how long this has gone on. i just got back from taking my son to the ENT and their payment processing was down due to the outage. getting close to 11 hours since the first reports.

      really wild how long this has gone on. i just got back from taking my son to the ENT and their payment processing was down due to the outage. getting close to 11 hours since the first reports.

      1 vote
    9. lou
      Link Parent
      Mercado Livre, Brazil's Ebay, cannot show when my delivery is arriving.

      Mercado Livre, Brazil's Ebay, cannot show when my delivery is arriving.

  3. trim
    Link
    I work in tech deploying software to customers through AWS. It's been a morning. Fortunately not too much impact on eu-west but global services and our access to console, and monitoring were...

    I work in tech deploying software to customers through AWS. It's been a morning. Fortunately not too much impact on eu-west but global services and our access to console, and monitoring were affected. Kind of like being in the dark whilst fielding questions we can't answer about metrics we can't see.

    15 votes
  4. TaylorSwiftsPickles
    Link
    Almost no issues on my end. Everything's worked normally and I wouldn't even have noticed if 3 people didn't try to contact me on signal when it was down. Works normally now.

    Almost no issues on my end. Everything's worked normally and I wouldn't even have noticed if 3 people didn't try to contact me on signal when it was down. Works normally now.

    6 votes
  5. hamstergeddon
    (edited )
    Link
    It's times like these that I'm glad I never segued into devops. But then I'd have a bit more money to cry into about it if I had. Edit- as it stands I can't access my Jira board which is going to...

    It's times like these that I'm glad I never segued into devops. But then I'd have a bit more money to cry into about it if I had.

    Edit- as it stands I can't access my Jira board which is going to make for an interesting stand up meeting in about 15 minutes :)

    5 votes
  6. redwall_hp
    Link
    Fun take from The Register: Today is when the Amazon brain drain finally sent AWS down the spout Good to call out the three solid years of layoffs and forced attrition through return-to-office...

    Fun take from The Register: Today is when the Amazon brain drain finally sent AWS down the spout

    Good to call out the three solid years of layoffs and forced attrition through return-to-office policies at Amazon...

    5 votes
  7. zestier
    Link
    Since it is marked as resolved here's probably the final update until they release a full writeup.

    Since it is marked as resolved here's probably the final update until they release a full writeup.

    Oct 20 3:53 PM PDT Between 11:49 PM PDT on October 19 and 2:24 AM PDT on October 20, we experienced increased error rates and latencies for AWS Services in the US-EAST-1 Region. Additionally, services or features that rely on US-EAST-1 endpoints such as IAM and DynamoDB Global Tables also experienced issues during this time. At 12:26 AM on October 20, we identified the trigger of the event as DNS resolution issues for the regional DynamoDB service endpoints. After resolving the DynamoDB DNS issue at 2:24 AM, services began recovering but we had a subsequent impairment in the internal subsystem of EC2 that is responsible for launching EC2 instances due to its dependency on DynamoDB. As we continued to work through EC2 instance launch impairments, Network Load Balancer health checks also became impaired, resulting in network connectivity issues in multiple services such as Lambda, DynamoDB, and CloudWatch. We recovered the Network Load Balancer health checks at 9:38 AM. As part of the recovery effort, we temporarily throttled some operations such as EC2 instance launches, processing of SQS queues via Lambda Event Source Mappings, and asynchronous Lambda invocations. Over time we reduced throttling of operations and worked in parallel to resolve network connectivity issues until the services fully recovered. By 3:01 PM, all AWS services returned to normal operations. Some services such as AWS Config, Redshift, and Connect continue to have a backlog of messages that they will finish processing over the next few hours. We will share a detailed AWS post-event summary.

    3 votes
  8. [3]
    Eji1700
    Link
    As an shop that uses azure instead of AWS, bullet dodged internally BUT naturally our 3rd party vendors use AWS, so still a mess from upstream data.

    As an shop that uses azure instead of AWS, bullet dodged internally BUT naturally our 3rd party vendors use AWS, so still a mess from upstream data.

    2 votes
    1. [2]
      tanglisha
      Link Parent
      Here’s an article from 2019 by someone who tried to completely block Amazon from their life.
      5 votes
      1. CannibalisticApple
        Link Parent
        Wow. That was a fascinating read, and probably worthy of its own individual topic. It highlights Amazon really is too powerful given how many services it can impact.

        Wow. That was a fascinating read, and probably worthy of its own individual topic. It highlights Amazon really is too powerful given how many services it can impact.

        3 votes
  9. [5]
    TheFireTheft
    Link
    I couldn't find anything in the article about the root cause, but let me guess: Config related? Did somebody mess up a YAML file?

    I couldn't find anything in the article about the root cause, but let me guess:

    Config related?

    Did somebody mess up a YAML file?

    2 votes
    1. [4]
      zestier
      (edited )
      Link Parent
      https://health.aws.amazon.com/health/status has a handful of updates. The one that most directly answers your question is I'm reasonably confident that the teams in charge of DDB are in Seattle so...

      https://health.aws.amazon.com/health/status has a handful of updates. The one that most directly answers your question is

      Oct 20 2:01 AM PDT We have identified a potential root cause for error rates for the DynamoDB APIs in the US-EAST-1 Region. Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovery. This issue also affects other AWS Services in the US-EAST-1 Region. Global services or features that rely on US-EAST-1 endpoints such as IAM updates and DynamoDB Global tables may also be experiencing issues. During this time, customers may be unable to create or update Support Cases. We recommend customers continue to retry any failed requests. We will continue to provide updates as we have more information to share, or by 2:45 AM.

      I'm reasonably confident that the teams in charge of DDB are in Seattle so very unlikely they were doing any night time deployments of any kind (especially on a Sunday night).

      3 votes
      1. [3]
        first-must-burn
        Link Parent
        It's always DNS.

        It's always DNS.

        11 votes
        1. [2]
          trim
          Link Parent
          BGP enters stage left, glancing furtively.

          BGP enters stage left, glancing furtively.

          4 votes
          1. first-must-burn
            Link Parent
            DNS stands on BGP's shoulders in a trench coat, and they try to pass as an application level error.

            DNS stands on BGP's shoulders in a trench coat, and they try to pass as an application level error.

            3 votes