10 votes

More details about Facebook's October 4th outage

5 comments

  1. [4]
    vord
    Link
    This paragraph, right here, is why medium-large organizations should own their own infrastructure. When you're purely on "the cloud", you are 100% at the mercy of your provider to get everything...

    our engineers worked to figure out what was happening and why, they faced two large obstacles: first, it was not possible to access our data centers through our normal means because their networks were down, and second, the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this.

    This paragraph, right here, is why medium-large organizations should own their own infrastructure.

    When you're purely on "the cloud", you are 100% at the mercy of your provider to get everything fixed for you. If you're not the biggest fish in their pond, you're gonna be a lower priority. For the bigger providers, that's their own internal services.

    With centralized providers comes centralized failures.

    Let's suppose an AWS outage happens on a similiar scale for a similiar timeframe. It could spell outright disaster for many that rely on them.

    Decentralizing the net means if a cloud provider goes down, the rest of the net doesn't go down with it.

    4 votes
    1. mat
      Link Parent
      But Facebook do own their own infrastructure. They own more of it than some cloud providers! That everything was running on their own network was how this problem could happen at all. Also, this...

      This paragraph, right here, is why medium-large organizations should own their own infrastructure.

      But Facebook do own their own infrastructure. They own more of it than some cloud providers! That everything was running on their own network was how this problem could happen at all. Also, this was a perfect (shit)storm situation - I doubt we'll see this kind of failure again because I know for a fact that FB are right now reviewing their protocols to make sure this is impossible next time. As is anyone else with a clue (or financial interest) who operates at that level - but most people don't need to go near BGP ever.

      With centralized providers comes centralized failures.

      Yup. And that's why cloud services are so attractive. They're not centralised. Personally I'd much rather mitigate my risk by having my hosting and IT services in virtual machines spread over multiple DCs with lots of dark capacity - like AWS or Akamai or whatever - than run my own hardware. Hardware is a pain, there's a good reason it's been so successfully turned into a service.

      It could spell outright disaster for many that rely on them.

      If you have "outright disaster" after a couple of hours kind of availability requirements then you're already spread over multiple providers and possibly your own as well for final final panic backup.

      8 votes
    2. [2]
      Adys
      Link Parent
      I agree in theory. But in practice, most medium-large orgs don't have the technical capacity to do this, and if they were to try, it'd be a security nightmare.

      I agree in theory. But in practice, most medium-large orgs don't have the technical capacity to do this, and if they were to try, it'd be a security nightmare.

      6 votes
      1. vord
        Link Parent
        Honestly, I consider that mostly a vendor lie to mask how easy this stuff is on a small-medium scale. In 2000ish, I helped run a small datacenter with 2 other techs, also doing tech support, for a...

        Honestly, I consider that mostly a vendor lie to mask how easy this stuff is on a small-medium scale.

        In 2000ish, I helped run a small datacenter with 2 other techs, also doing tech support, for a company with 200 something employees. I was just an intern, but we managed 2 racks, a few standalone units, a small dedicated AC system, and some networking hardware.

        Automated tools were not nearly as powerful, desktops were not nearly as robust and more expensive. We often hacked together systems using leftover parts and built our own custom OS images.

        You learn to setup firewalls, keep things patched, and take backups and 95% of the problems are immediately solved.

        Large orgs can easily figure this stuff out. Somehow the big tech companies did. There's no magic secret sauce that makes them different, other than the non-giants not being able to figure out the value of these things on their own.

        4 votes
  2. Adys
    Link
    Some discussion on HN here (~250 comments). Highlights: Thread 1 Thread 2

    Some discussion on HN here (~250 comments).

    Highlights:

    1 vote