24 votes

Twitch, Pinterest, Reddit and more go down in Fastly CDN outage

32 comments

  1. [19]
    ainar-g
    Link
    Centralisation strikes again. If this comment on HN is to be believed, even gov.uk was down, which is particularly concerning. I could understand private enterprises using other companies' servers...

    Centralisation strikes again. If this comment on HN is to be believed, even gov.uk was down, which is particularly concerning. I could understand private enterprises using other companies' servers to store assets, but governments? Call my views outdated, but I think that that's just a disgrace.

    7 votes
    1. [12]
      mat
      Link Parent
      I'm genuinely curious as to what you think is disgraceful about it? AWS etc are, in many cases, cheaper and more reliable than running one's own servers and probably more secure as well, given the...

      I'm genuinely curious as to what you think is disgraceful about it?

      AWS etc are, in many cases, cheaper and more reliable than running one's own servers and probably more secure as well, given the amount of budget state projects generally don't have for IT. Why shouldn't a government project use industry standard practices which offer better value for money than other methods?

      I'd much rather gov.uk spent my money on their world-leading user interface work than wasting it on hosting and systems management.

      12 votes
      1. [3]
        ainar-g
        Link Parent
        Minimisation of dependencies (simpler, more understandable, and thus more reliable infrastructure), maximisation of control over your data (is it easy to migrate all of your data to the...

        Why shouldn't a government project use industry standard practices which offer better value for money than other methods?

        Minimisation of dependencies (simpler, more understandable, and thus more reliable infrastructure), maximisation of control over your data (is it easy to migrate all of your data to the competitors' service? or analyse large chunks of it? or delete a part of it right now?), actual support for decentralisation of the web (and thus decrease of power of the big players).

        As I see it, most websites—commercial or not—fall into one of two groups:

        • Those with a user base small enough that they can basically host their own data on one or two machines.

        • Those that are bigger, and so probably already have a dedicated sysadmin team any way, so having your own small CDN isn't that much of an issue. Here, the additional benefit is that when somebody notices an issue, reporting it to the sysadmin team is sometimes as easy as walking to the other side of the open space.

        And based on my personal experience with web development in both big (≥ 100,000 people) and small (≤ 100 people) companies, the only real reason I've seen so far for moving to external infrastructure is way more banal: the company just can't find enough sysadmins.

        9 votes
        1. Adys
          Link Parent
          Just because the infrastructure is simpler doesn't make it more reliable. Governments can deal with ridiculously high workloads. COVID announcement platforms for example which everyone checks at...

          Minimisation of dependencies (simpler, more understandable, and thus more reliable infrastructure)

          Just because the infrastructure is simpler doesn't make it more reliable.

          Governments can deal with ridiculously high workloads. COVID announcement platforms for example which everyone checks at once during a press release.

          Governments aren't in the business of creating and providing their own CDNs. Fastly does it fine.

          I don't really know what you're trying to claim. This post is honestly a bit reminiscent of the low quality HN trash of "oh but this is so simple I could do it in a weekend". There are valid reasons why you might not want the UK govt to use fastly, for example privacy and visibility over what British citizens are accessing. But simplicity?

          9 votes
        2. mat
          Link Parent
          Dependencies aren't an issue if you abstract everything properly. It's relatively trivial to migrate between backends provided you've done your job right during setup. Making things understandable...

          Dependencies aren't an issue if you abstract everything properly. It's relatively trivial to migrate between backends provided you've done your job right during setup.

          Making things understandable is best done by doing what most other people are doing. Which just on the numbers means doing AWS. But y'know, write clear code and document stuff and employ people with a clue and that's not an issue anyway.

          You'd be amazed how many large sites don't have full time sysadmin, let alone a team of them. I personally handled the hosting/backend for the websites of several large multinational companies. In my spare time between developing front end stuff for them and other clients. Once everything's set up it just goes most of the time. If it doesn't you probably need a better sysadmin. Places like reddit and facebook and stuff who do have teams of people doing this stuff are so far out of the league of almost everyone else, there's no real comparison.

          If you need someone to actually walk into your IT department to tell them there's a problem then your IT department aren't doing their jobs properly. They should be the first to know because their monitor board should be lighting up, phones beeping, emails getting fired off and so on.

          When you get to really large scale, hosting is just a pain in the arse which any decent devops person will absolutely advise offloading to one of the big cloud providers. At gov.uk scale the kind of distributed and elastic services provided by AWS are an absolute godsend, compared to what a shitshow hosting was in the bad old days. Need more grunt? It's already happened by itself. Got an outage? Someone's fixing it almost before your monitoring has even picked up the problem, and it didn't affect everyone anyway because the whole system is distributed.

          the company just can't find enough sysadmins.

          Or to put it another way - the company has realised they don't need to spend money paying someone to do a job which doesn't need to exist.

          6 votes
      2. Octofox
        Link Parent
        I think people are forgetting what these modern services are replacing. The traditional government services model involved very limited opening hours. A government page going down for an hour once...

        I think people are forgetting what these modern services are replacing. The traditional government services model involved very limited opening hours. A government page going down for an hour once every few years is still an outstanding improvement.

        And as you say, self managed services fail even more.

        8 votes
      3. [7]
        archevel
        Link Parent
        I wouldn't call it disgraceful, but I do think there are plenty of valid reasons why we would want the government to control its own infrastructure. Privacy concerns. If e.g taxes were handled on...

        I wouldn't call it disgraceful, but I do think there are plenty of valid reasons why we would want the government to control its own infrastructure.

        1. Privacy concerns. If e.g taxes were handled on AWS then Amazon could potentially know a lot more about not only people but their competitors too. Similarly for healthcare systems and a whole host of sensitive information the government processes about individuals. Maybe you are fine with that, maybe you could argue Amazon would be better at managing that then the government. Still I think it's a valid concern.
        2. National security concerns. In the extreme case, should the nuke launch interface be hosted at Amazon? There are of course degrees here, and maybe nukes shouldn't be accessible from the web with an open api and a nice swagger definition. But lesser things such as logistics systems should they depend on Amazon?
        3. Building competence. A former colleague had a nice saying: if it hurts do it more often. By that he meant that in order to learn something and become good at it you need to be exposed to it. That's the best way of figuring out what things you need to change in order to improve. So if government is bad at hosting and security and XYZ they should either stop doing it completely or, in case it is something government should do, do it more.
        4. Economics. Government is big and likely have enough needs to justify building a clone of AWS to host their services. Running AWS services isn't cheap compared to what you could buy if you had your own server halls. AWS is a lot more convenient, but maybe a gov clone could make it almost as easy?
        5. Lock-in in choice of vendor. Sure you could have a lengthy procurement process picking either AWS, GCP or Azure or someone else. But then you're basically locked in. Migrating from one provider to another would be a massive undertaking. Might be better to avoid that mess all together.

        There are also a bunch of reasons why using a cloud provider might be preferable...

        7 votes
        1. [4]
          mat
          Link Parent
          If you're not encrypting that sort of sensitive data then you're not doing your job at all right. That is not a thing. Nobody is saying let's do that. We're talking about a simple public service...
          1. If you're not encrypting that sort of sensitive data then you're not doing your job at all right.
          2. That is not a thing. Nobody is saying let's do that. We're talking about a simple public service website here. The worst that happens is someone can't buy their car tax for a few hours. Obviously security and critical infrastructure should be on more robust systems (although even then, AWS/etc manage seriously good availability - which is why it's news when they do go down)
          3. Yeah, that's not very convincing, sorry. Doing things the easy way is fine. If it works, it works. In the case of government, they should stop wasting time and taxpayer's money doing things the hard way just to prove a point. It doesn't make you better at sysops to run your own iron, same way that running an AWS/Azure/etc instance doesn't magically make you better at security (although it does provide some protections self-hosting doesn't). fwiw, one of the ways I know someone understands about how to do hosting right is that above a certain, very small scale, they recommend doing it on a distributed service.
          4. Economics is exactly why AWS/etc is a better choice. Why pay to build, connect, populate, staff and maintain datacentres which you mostly don't need, so that the few times you do, they're there? That's wasting money whether you're a government or a company or an individual. AWS is far more economical than running a DC. That's exactly why those services are so popular. As an aside, I have a friend in High Performance Computing who used to get to manage a minor supercomputer for a large biotech firm - he got to work in a big room full of racks and ducts and stuff, sounded great. When it was decommissioned, they didn't bother replacing it because it was cheaper to just run up a big AWS compute instance to run a model on, then power it off. Now all my friend does is software stuff.
          5. Meh, it's not an issue if you set up right. Migrating between backends is trivial if you've planned for that in advance. Certainly it's not significantly more work than moving servers you outright own.

          The reasons cloud hosting is better are far more than the reasons it's not. Which is why almost everyone does it now.

          6 votes
          1. [3]
            archevel
            Link Parent
            You don't have control over the source of entropy so your encryption can be poisoned by AWS. Likely? No, but as a government it should hopefully be included in your threat assessment. It's a...
            1. You don't have control over the source of entropy so your encryption can be poisoned by AWS. Likely? No, but as a government it should hopefully be included in your threat assessment.
            2. It's a spectrum right? Military stuff might not be ok, but perhaps the app where absences are reported for school kids is fine. What about police radio systems? Judicial record keeping? 911 dispatch systems?
            3. I think you're right it isn't too convincing an argument. Looking at it now it is a bit circular in its reasoning and could be used to say that government should do almost anything. Late night comedy, government isn't great at that so should do more of it! However, if we limit it to things government should do (whatever that is defined as) learning to do it well is crucial and IMO best done through practice.
            4. Governments likely have enough systems running that it does make sense economically to run them themselves. This is just a guess on my end, but taking a look from my personal experience we did the numbers for running some metrics analytics system we have in the cloud vs on prem buying the hardware. On prem solution was cost effective in a matter of months! YMMV of course. The biggest reason for small startups to use a cloud solution is more often because, short term the investment doesn't make sense and you don't want to spend time setting up basic infrastructure as opposed to building your product. Government is in a vastly different problem setting. Their systems usually need to move much slower.
            5. Have you done this for any sizeable organisation? I think you are vastly underestimating the effort that would be required to move all gov systems from one place to another. Each system might not be too hard in and of itself, but the aggregate problem and coordination required seems staggering to me.
            3 votes
            1. TheJorro
              Link Parent
              So I've got firsthand experience with G7 governments and finding IT solutions for large-scale enterprise and public projects. Government would not look twice at a service that hasn't been rated...
              • Exemplary

              So I've got firsthand experience with G7 governments and finding IT solutions for large-scale enterprise and public projects.

              1. Government would not look twice at a service that hasn't been rated independently for security multiple times. The US even has its own rating system with FedRAMP, where they audit any service they'll use for security and make sure it's a certain level of compliant before they go with it. The HIPAA rating is also a good indicator that the company will absolutely not be spying on your data because health records are some of the most universally protected data out there with high penalties for breaches. A service that breaks these would not only be looking at losing very lucrative government contracts, they'd also be looking down the barrel of a very angry government.

              2. Military technology is almost never mixed in with regular government systems, they're often independently operated for security reasons. This will also vary the most by country and agency but, for what it's worth, either the CIA or the NSA use AWS.

              3. Government IT needs to be blown up in general. In my experience, it has become an unregulated, hidden fiefdom of people who answer only to themselves. I may be rather jaded on this subject, but in general I agree with your general sentiment on this. Government should prioritize IT more and actually practice doing it better, the same as they do any other policy. Many governments have created Digital groups in an attempt to start addressing this more but that's an experiment that's still ongoing.

              4. This relates to 3. in that there's just not much push within governments so far to actually do this. Mine recently built a data centre that was 5 years out of date when it opened. It hasn't changed at all since opening and is now further behind than ever. Further, there's an internal economy in many governments that is just flat out broken—it can often cost more to get IT products internally than it would to go outside. Getting more server space in this data centre costs thousands, takes two weeks, and requires three or four levels of approval. AWS provides it for pennies immediately. Something like building their own cloud servers would take so long to happen that they would be outdated as soon as they're done. It would make sense for G7 governments to have their own internal cloud storage systems... but by the time they build them, it will probably be years too late and many dollars short. Government just doesn't prioritize or innovate in tech like Amazon, Google, Microsoft, and IBM do and they certainly don't have the resources to compete.

              5. Government has been outsourcing many things to external IT vendors since they first started implementing IT at all. Vendor lock-in has always been around, especially with IBM and Oracle at the helm of many contracted builds. People would be shocked to find out how many government systems are running on ancient, very insecure versions of Java. It's difficult to transfer data between systems already with government, and can easily blow up. Cloud providers like Amazon, Microsoft, and Google are actually miles ahead of the curve in this regard.

              Basically, the issue lies beyond just looking at the technology itself. There's many other issues and factor at play when it comes to government IT, starting with how IT is even seen within government in the first place.

              11 votes
            2. mat
              Link Parent
              Well, you do because you don't have to do everything on an AWS system, you could offload entropy to a machine you have full control over but at that point I think you might be drifting a little...
              1. Well, you do because you don't have to do everything on an AWS system, you could offload entropy to a machine you have full control over but at that point I think you might be drifting a little out of the realm of security conscious and into paranoia. Can we really trust the private companies who own the backbone the data is transiting over? How about the SSD manufacturers? etc. etc.
              2. At the moment we are talking about a website. It's mostly static information pages. You can buy your car tax and stuff. There's some mildly sensitive information involved. Slippery slope arguments rarely stand up: "AWS is bad for handling real-time nuclear systems therefore we shouldn't use it to host a page of advice about your house's sewerage rights"
              3. I think I mentioned this before but industry best practice for website hosting usually is AWS/etc. It's a very attractive compromise of availability, scalability, value for money and so on.
              4. You could probably host gov.uk for a decade for the price of building a single small DC, let alone provisioning and running the thing. Look at it like this, gov.uk is a considerably smaller project than reddit and reddit don't have their own DCs - because AWS is cheaper for them. The UK government does have it's own datacentres, but they're for more serious stuff. GCHQ (UK equiv of CIA) has some absolutely bonkers data storage and processing infrastructure, all done in-house, by top notch staff and behind insane security. I don't doubt the military and police have stuff too, not to mention the NHS (which is chronically underfunded and out of date but still).
              5. But we're not talking about all government systems. It's just a website. One which contains some sensitive information and has a few slightly unusual requirements - but so do plenty of websites. I don't know exactly how big gov.uk but it's mostly flat text-only pages, some PDFs and a few databases. It's not going to be huge. I can guarantee I've worked on bigger projects in terms of data storage requirements and no, moving them isn't that big of a deal. Even in the days before everything being neatly virtualised and/or containered and where changing cloud backends wasn't just a matter of adjusting a few variables in a config file (it's not 100% that simple but it's not far off).
              6 votes
        2. [2]
          Micycle_the_Bichael
          Link Parent
          If you're working with PII data and it isn't always encrypted and Amazon can access that data then you've already failed basic PII data compliance rules, doesn't matter if you're in the cloud or...

          Privacy concerns. If e.g taxes were handled on AWS then Amazon could potentially know a lot more about not only people but their competitors too. Similarly for healthcare systems and a whole host of sensitive information the government processes about individuals. Maybe you are fine with that, maybe you could argue Amazon would be better at managing that then the government. Still I think it's a valid concern.

          If you're working with PII data and it isn't always encrypted and Amazon can access that data then you've already failed basic PII data compliance rules, doesn't matter if you're in the cloud or on-prem. That has nothing to do with who is hosting and has to do with basic security.

          3 votes
          1. archevel
            Link Parent
            Well if you are on Amazon they are your source for random values which is to say any keys generated they could poison. You can encrypt all you want, but if the attacker controls your key gen...

            Well if you are on Amazon they are your source for random values which is to say any keys generated they could poison. You can encrypt all you want, but if the attacker controls your key gen you've basically lost already. But taking the likelihood of that being the case is obviously fairly low.

            2 votes
    2. [6]
      Greg
      Link Parent
      Really? I can see where you're coming from, but I don't see a major issue with governments using standard off the shelf services rather than reinventing the wheel. In most cases, I'd actively...

      Really? I can see where you're coming from, but I don't see a major issue with governments using standard off the shelf services rather than reinventing the wheel. In most cases, I'd actively prefer it.

      I'd rather they use what works, and works cheaply and reliably. This one newsworthy outage doesn't change that - there's every chance that their own government run systems could drop for a few hours, it just wouldn't be nearly as widely reported because this affects an appreciable portion of the internet.

      Obviously that centralisation is an issue, and it raises important questions about overall system resilience and the value of diverse fallback options for critical systems, but for non-critical public service websites I see this as a gentle nudge to look at their failover strategy, nothing more than that.

      6 votes
      1. ainar-g
        Link Parent
        Exactly. The point is not having juggernauts whose one failure makes a large portion of the Internet, including government services, go boom at the same time. And I think that governments should...

        […] there's every chance that their own government run systems could drop for a few hours, it just wouldn't be nearly as widely reported because this affects an appreciable portion of the internet.

        Exactly. The point is not having juggernauts whose one failure makes a large portion of the Internet, including government services, go boom at the same time. And I think that governments should in general strive towards that goal, including by setting an example.

        7 votes
      2. Greg
        Link Parent
        Interesting side note: both gov.uk and the BBC did have redundant systems in place with a second provider. The BBC's worked, the government's didn't (or at least didn't immediately) because the...

        Interesting side note: both gov.uk and the BBC did have redundant systems in place with a second provider.

        The BBC's worked, the government's didn't (or at least didn't immediately) because the switchover needed manual intervention.

        6 votes
      3. [3]
        vord
        Link Parent
        But at what point should it? "Small configuration error wipes out large quantities of the internet." is rapidly becoming a trope. The only answer is proper decentralization. That way when someone...

        This one newsworthy outage doesn't change that

        But at what point should it? "Small configuration error wipes out large quantities of the internet." is rapidly becoming a trope. The only answer is proper decentralization. That way when someone does screw up, it doesn't wipe out half the internet.

        1 vote
        1. [2]
          Greg
          Link Parent
          As I see it, reliability and centralisation are orthogonal: fastly is reliable, and remains so as long as it maintains some arbitrary but reasonable number of nines uptime. Even with this outage,...

          As I see it, reliability and centralisation are orthogonal: fastly is reliable, and remains so as long as it maintains some arbitrary but reasonable number of nines uptime. Even with this outage, they're highly likely to be better than 99.95% over the year.

          Centralisation isn't about reliability per se, it's about the impact of failure when it does happen. Obviously if people were centralising on a less reliable service for some reason that'd be an issue in and of itself, but what we're talking about here is more the issue of very reliable services having a huge impact even from the small outages they do have.

          On that, I'll throw the question back to you: at what point are we too centralised? Is it enough to have 5-10 major players and pick any two for redundancy? What about only 3 major players? Or does everyone need to run a CDN? If so, how concerned are we by the centralisation of the underlying connectivity between those CDN edges?

          I don't have answers, and to some extent there are no absolute right or wrongs here, but for now I'm less worried about the centralisation itself given that most companies aren't taking advantage of the very real option to use a second redundant provider. If they don't value uptime enough to do that (perhaps reasonably so, depending on the impact of downtime), I wouldn't expect them to do a better job of running their own infrastructure.

          3 votes
          1. vord
            Link Parent
            I'm a full proponent of decentralization, so IMO if your company has more than 100 people, you should be running some sort of server closet with at least 3-4 IT generalists. 4 people in an on-call...

            On that, I'll throw the question back to you: at what point are we too centralised? Is it enough to have 5-10 major players and pick any two for redundancy? What about only 3 major players? Or does everyone need to run a CDN? If so, how concerned are we by the centralisation of the underlying connectivity between those CDN edges?

            I'm a full proponent of decentralization, so IMO if your company has more than 100 people, you should be running some sort of server closet with at least 3-4 IT generalists. 4 people in an on-call rotation can easily manage 100+ physical servers and accompanying software. If you're a large enough company that needs rapid, global distribution of content, then you should be purchasing and managing hardware accordingly. Perhaps just doing traditional co-location so you don't need to worry about acquiring buildings, but still actually owning and managing your own hardware and software stacks.

            Not owning hardware means your company won't understand hardware. You'll be taking vendors at their word because there won't be any organizational knowledge to sniff out the bullshit. And vendors lie. They lie a lot, and somehow they're turning a massive profit off their customers, and I promise you economics of scale is not the primary reason.

            There is a cyclical nature to centralized/decentralized, ownership/renting in computing. We might have hit a point where there's no coming back from centralized renting, just due to the tremendous migration to it in the last decade is. I hope not though because the reverse is much more interesting.

            2 votes
  2. [2]
    Deimos
    Link
    Fastly's official blog post about the incident: Summary of June 8 outage

    Fastly's official blog post about the incident: Summary of June 8 outage

    7 votes
    1. AugustusFerdinand
      Link Parent
      Pertinent part: Single user changed a setting and brought the whole thing down. That's one hell of a bug.

      Pertinent part:

      Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.

      Single user changed a setting and brought the whole thing down. That's one hell of a bug.

      4 votes
  3. [2]
    HoolaBoola
    Link
    I thought I was going crazy when emojis suddenly stopped working correctly on Twitter. Now I see it's not an isolated problem. GitHub also seems to not show profile pictures.

    I thought I was going crazy when emojis suddenly stopped working correctly on Twitter. Now I see it's not an isolated problem. GitHub also seems to not show profile pictures.

    5 votes
    1. raze2012
      Link Parent
      it was surprisingly widespread. I got a new laptop and didn't think much of reddit being down (happens semi-often). But I couldn't even access servers needed for me to download Git (not github,...

      it was surprisingly widespread. I got a new laptop and didn't think much of reddit being down (happens semi-often). But I couldn't even access servers needed for me to download Git (not github, the
      VCS itself) and Python, so I was wondering for a moment if something on my network was broken.

      1 vote
  4. [4]
    ali
    (edited )
    Link
    If there's a better news source, maybe someone can (or I will later) update the link. It's been down for like half an hour now I think. Pretty crazy how many big services go down at once, when...

    If there's a better news source, maybe someone can (or I will later) update the link.

    It's been down for like half an hour now I think. Pretty crazy how many big services go down at once, when amazon has issues

    edit: thanks @mycketforvirrad

    3 votes
    1. [2]
      bhrgunatha
      (edited )
      Link Parent
      It's fastly's cdn causing websites to fail. Out of interest can you elaborate on why you say amazon? I mean maybe amazon is the root cause of fastly's failure because I don't know. Edit: OK I...

      It's fastly's cdn causing websites to fail.

      Out of interest can you elaborate on why you say amazon? I mean maybe amazon is the root cause of fastly's failure because I don't know.

      Edit: OK I think I found an explanation in that thread

      3 votes
      1. ali
        Link Parent
        I didn't see that it was a fault with fastly, I just assumed it was AWS issues

        I didn't see that it was a fault with fastly, I just assumed it was AWS issues

        2 votes
    2. mycketforvirrad
      Link Parent
      No worries. I'm following the fun on Twitter at the moment. Sure beats the data entry I was half heartedly doing...

      No worries. I'm following the fun on Twitter at the moment. Sure beats the data entry I was half heartedly doing...

      1 vote
  5. Eric_the_Cerise
    Link
    I noticed it when both The Guardian and CNN went down. It took me about 15 minutes to figure out that it was actually Fastly, by which time there was already a sourced, historical reference to the...

    I noticed it when both The Guardian and CNN went down. It took me about 15 minutes to figure out that it was actually Fastly, by which time there was already a sourced, historical reference to the incident in Wikipedia.

    Ironically, IsItDownRightNow was also down.

    2 votes
  6. [2]
    cloud_loud
    Link
    What does it mean if I didn't notice this? Was I just not affected?

    What does it mean if I didn't notice this? Was I just not affected?

    2 votes
    1. Wes
      Link Parent
      The outage was sweeping, but not very long. You could have easily been walking the dog or taking a shower while it occurred.

      The outage was sweeping, but not very long. You could have easily been walking the dog or taking a shower while it occurred.

      5 votes
  7. knocklessmonster
    Link
    I was surfing the internet late last night and it damn near felt like half of the internet was gone. I quickly put together that it was Fastly and figured I'd read about it today (I have). I was...

    I was surfing the internet late last night and it damn near felt like half of the internet was gone. I quickly put together that it was Fastly and figured I'd read about it today (I have).

    I was trying to get quick information about some games before I turned in and the usual sites were dead.