Who’s liable when your AI agent burns down production? How Amazon’s Kiro took down AWS for thirteen hours and why the ‘human error’ label tells you everything wrong about the agentic AI era.

[6]

patience_limited (OP)

February 22

Link

From the article: *The delete heard in a datacentre in China" This is a good essay about the hazards implicit in agentic AI processes, broadly applicable outside Amazon. It discusses the human...

From the article:

*The delete heard in a datacentre in China"

Somewhere in a datacentre in mainland China in December 2025, an AI coding agent called Kiro made a decision. A very bold one at that. The task in front of it (we don’t know exactly what it was) apparently had a solution. And the solution, Kiro determined, was to delete the cloud environment it was working in and rebuild it from scratch.

This is, to be fair, sometimes a valid engineering move. Anyone who has spent three hours debugging a corrupted local environment before realising the fastest fix is to nuke it and start fresh has made this calculation.

The difference is that when a human engineer makes that call, they’ve thought about it, checked that nothing critical lives in the environment, maybe groaned a little, and then typed the commands with full awareness of what they’re doing. (Bonus: possibly even let even the immediate team know about the impending move in some cases.)

Kiro didn’t groan. Kiro just friggin’ did it.

The result was a 13-hour AWS outage in mainland China. The Financial Times broke the story. Amazon responded with a statement that is, depending on your professional background, either entirely reasonable or one of the most extraordinary deflections in recent corporate history.

The statement was approximately: this was human error, an operator granted Kiro excessive permissions, and here’s the part that lodges in the brain: “the same issue could occur with any developer tool or manual action.”

We’ll come back to that sentence. It deserves its own section, a cuppa tea, and possibly a long walk afterwards.

What makes this incident more than a standard cloud outage postmortem is not the 13 hours, or even the autonomous deletion. It’s the pattern it fits into. Automation fails, company reaches for the human error label, documentation about not doing the thing that caused the failure turns out to have existed the whole time.

This is the story of automated systems for the last fifty years, now wearing new clothes.

The clothes are just considerably more capable (and ostensibly more dangerous) than they used to be.
...

Amazon is both the company whose cloud infrastructure documentation says “don’t grant more permissions than necessary” and the company whose AI product was deployed with more permissions than necessary, causing an outage.

To be precise: these are different teams at Amazon. The IAM best practices documentation is written by AWS’s security teams. Kiro is an Amazon product. The operator who configured the permissions is a customer (or a customer’s employee). None of these people necessarily talked to each other.

That’s not a defence. It’s actually the more damning version of the story.

The question isn’t whether the operator violated best practices (they did, apparently). The question is: did Kiro’s onboarding process make it straightforward to grant minimal, scoped permissions? Or did getting the tool to work in a real environment effectively require broad access?

We don’t have the full setup logs. But based on what we know about how enterprises configure AI agents, and based on a senior AWS employee’s description of the incident as “small but entirely foreseeable”, the answer looks like the latter.

Giving a new AI agent the appropriate minimum permissions to function is non-trivial. It requires understanding what the agent will actually do, which is harder to predict with an autonomous system than with a tool that only does what you explicitly tell it to do. If Kiro’s setup process nudged operators towards broad permissions because scoped permissions were too difficult to get right, that’s a product design failure, full stop.
...

Ponder what that means for an agentic AI coding tool from Amazon (or for any other agentic tool like Cursor or GitHub Copilot for that matter). It doesn’t just sound capable; it comes with the implicit endorsement of a company that runs roughly 30% of the world’s cloud infrastructure and has invested $4 billion in Anthropic.

When an engineer sets up Kiro, they’re not thinking “I should carefully scope every permission this thing might need.” They’re thinking “this is Amazon’s product, built for AWS, presumably designed to behave sensibly in AWS environments.”

That assumption is automation bias in action. It’s not stupidity. It’s a documented cognitive pattern that affects everyone who works with automated systems, and it gets stronger as the systems get more sophisticated and more articulate.

This is a good essay about the hazards implicit in agentic AI processes, broadly applicable outside Amazon. It discusses the human tendency to automation bias. It also highlights the way Amazon has deflected responsibility for design failures in a way that U.S. laws frown upon, citing a couple of examples (Knight Capital's $440 million loss due to a test algorithm accidentally left running, and Boeing's 737 Max crashes).

30 votes

[5]
Minori
February 25
Link Parent
I will delete this in the future. I know an Amazon employee, and I can confidently say there have been at least five major outages caused by LLM tools deleting a production stack or otherwise...

I will delete this in the future.

I know an Amazon employee, and I can confidently say there have been at least five major outages caused by LLM tools deleting a production stack or otherwise fucking up a configuration. These are internal Correction of Errors documents (called COE's internally).

The default permissions are expansive, and there are no good methods for gating certain tools or commands. The internal guidance encourages highly autonomous agents with too many permissions. They've ratcheted down any access to external tooling, and the internal ones end up like this.

7 votes
1. [4]
  nukeman
  February 25
  Link Parent
  Why delete?
  
  Why delete?
  1. patience_limited (OP)
    February 25
    Link Parent
    Understandable in the context of potentially revealing a connection to someone who could lose their job, be prosecuted, and/or sued into penury by a giga-corporation for violating an NDA?
    
    Understandable in the context of potentially revealing a connection to someone who could lose their job, be prosecuted, and/or sued into penury by a giga-corporation for violating an NDA?
    
    3 votes
  2. [2]
    Minori
    February 26
    Link Parent
    Heavily redact at the very least.
    
    Heavily redact at the very least.
    
    1 vote
    
    uvt
    February 26 (edited February 26)
    Link Parent
    Don’t worry, I’ll write my own Gen Z reinterpretation of your comment. Jeffy AI oof… big oof… AI bot… 1, 2, 3, 4, 5… not 6… 1, 2, 3, 4, 5, no server… 404… remove the server… outage… customer...
    
    Don’t worry, I’ll write my own Gen Z reinterpretation of your comment.
    
    Jeffy AI oof… big oof… AI bot… 1, 2, 3, 4, 5… not 6… 1, 2, 3, 4, 5, no server… 404… remove the server… outage… customer infrastructure… remove the server…
    
    Preventative measures… 🙅‍♂️ impossible… permissions… all of them… outage… 1, 2, 3, 4, 5…

[7]

skybrian

February 22

Link

Mature tech companies often have a "blameless postmortem" process where they create a full history of an incident and make recommendations at multiple zoom levels about how to fix the system so...

Mature tech companies often have a "blameless postmortem" process where they create a full history of an incident and make recommendations at multiple zoom levels about how to fix the system so that nothing like that happens again. As the name indicates, the focus is on fixing the system, not blaming workers for mistakes. (That's reserved for situations where there is clear malice.)

I expect that process will continue to work well with more AI automation. "Someone accidentally deleted the production database" isn't a new problem and the safeguards you need are similar.

Legal liability isn't a good lens to use when fixing the system.

14 votes

[4]
patience_limited (OP)
February 22 (edited February 22)
Link Parent
Except that the outage was blamed on a customer's employee, who would have had neither access to Amazon's "blameless postmortem" process nor any internal discussion of how to apply best practices...

Except that the outage was blamed on a customer's employee, who would have had neither access to Amazon's "blameless postmortem" process nor any internal discussion of how to apply best practices and make systemic changes.

Kiro is a product that was released for public use without appropriate guardrails, and a liability framework is applicable.

22 votes
1. [2]
  skybrian
  February 22
  Link Parent
  I don't have access to this article, but looking at the Financial Times article (archive) that it's apparently based on, that doesn't seem right? It sounds like the "user error" was that of an...
  
  the outage was blamed on a customer's employee
  
  I don't have access to this article, but looking at the Financial Times article (archive) that it's apparently based on, that doesn't seem right? It sounds like the "user error" was that of an Amazon employee, so this is basically an Amazon self-own.
  
  Also, if a customer has any way to take out an Amazon service, using AI or not, that is definitely an Amazon bug and the outage would be worth doing a post-mortem on.
  
  7 votes
  1. patience_limited (OP)
    February 22
    Link Parent
    Thank you for the correction. Amazon posted its own rebuttal to the FT article, which doesn't really exculpate them on the underlying point. Even if the risks for agents are identical to those of...
    
    Thank you for the correction. Amazon posted its own rebuttal to the FT article, which doesn't really exculpate them on the underlying point.
    
    Even if the risks for agents are identical to those of user error, there are changes in governance and systems design needed to mitigate and contain the blast radius of known failure modes. I ran across a playbook (paid content) that provides a good framework for what should have been done before turning the Kiro agent to act without supervision.
    
    9 votes
2. tanglisha
  February 22
  Link Parent
  Blame the intern.
  
  Blame the intern.
  
  3 votes
[2]
tanglisha
February 22
Link Parent
Amazon has very good blameless postmortems, I was quite surprised. Unfortunately, the company pits teams against each other, which can really hurt communication.

Amazon has very good blameless postmortems, I was quite surprised.

Unfortunately, the company pits teams against each other, which can really hurt communication.

11 votes
1. Minori
  February 25
  Link Parent
  They're encouraged to push post-mortems on other teams. It's a key part of the politics, especially at the senior level and up. Even the best orgs are subtly toxic due to the perverse incentives...
  
  They're encouraged to push post-mortems on other teams. It's a key part of the politics, especially at the senior level and up. Even the best orgs are subtly toxic due to the perverse incentives with stack ranking etc.
  
  5 votes

Narry

February 23

Link

I'm just going to reiterate my concern from this comment that I'm worried my main marketable skill is going to be "legal meat shield for the AI" and point out that the "from when I said it to when...

I'm just going to reiterate my concern from this comment that I'm worried my main marketable skill is going to be "legal meat shield for the AI" and point out that the "from when I said it to when it happened" time was <10 days. (Of course I was talking there about being the "head" of an AI-driven company, but I also was concerned I'd wind up doing PLEASE from How I Met Your Mother.)

4 votes

Link information

14 comments