This was the source of yesterday’s outage. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.
Oof, that's gotta feel rough. Wonder if was the tooling or the admin.
It's a little on the technical side of things, but I think this is the best write-up I've seen so far explaining exactly what happened in the outage of Facebook services yesterday. Definitely even...
It's a little on the technical side of things, but I think this is the best write-up I've seen so far explaining exactly what happened in the outage of Facebook services yesterday.
Definitely even further on the technical end, but Julia Evans also made a good blog post showing how you can look into BGP: Tools to explore BGP
It's linked in the Cloudflare post, but Facebook now has some details as well.
Oof, that's gotta feel rough. Wonder if was the tooling or the admin.
Well, both? It sounds like a two-layer swiss cheese model and both layers screwed up.
It's a little on the technical side of things, but I think this is the best write-up I've seen so far explaining exactly what happened in the outage of Facebook services yesterday.
Definitely even further on the technical end, but Julia Evans also made a good blog post showing how you can look into BGP: Tools to explore BGP