39 votes

FOSS infrastructure is under attack by AI companies

8 comments

  1. redshift
    Link
    It's a great post. The only part I disagree with is saying this: ...while also saying: I can easily imagine the majority of humans not waiting 1-2 minutes for a website to load. I doubt it's 97%...

    It's a great post. The only part I disagree with is saying this:

    Over Mastodon, one GNOME sysadmin, Bart Piotrowski, kindly shared some numbers to let people fully understand the scope of the problem. According to him, in around two hours and a half they received 81k total requests, and out of those only 3% passed Anubi's proof of work, hinting at 97% of the traffic being bots – an insane number!

    ...while also saying:

    there's one user reporting one minute delay, and another - from his phone - having to wait around two minutes.

    I can easily imagine the majority of humans not waiting 1-2 minutes for a website to load. I doubt it's 97% bots.

    That said, I would do the same thing, regardless of the real percentage - it's ludicrous the hosting costs they racked up because of irresponsible bots crawling their pages with no respect for robots.txt.

    16 votes
  2. [3]
    creesch
    Link
    The article touches on a something that is becoming more and more of a problem. I have seen several posts and news articles about it in the past weeks. A bit annoying that all sources are...

    The article touches on a something that is becoming more and more of a problem. I have seen several posts and news articles about it in the past weeks.

    A bit annoying that all sources are screenshots rather than links to the actual sources. The habit of using screenshots rather than quotes is also something I have an opinion about.

    13 votes
    1. [2]
      PuddleOfKittens
      (edited )
      Link Parent
      Screenshots are just caching the source - linking is unreliable because links can go dead at any time, articles can be silently rewritten, tweets deleted, and e.g. Discord messages might be...

      Screenshots are just caching the source - linking is unreliable because links can go dead at any time, articles can be silently rewritten, tweets deleted, and e.g. Discord messages might be impossible to link in the first place.

      Archive.org might be a better solution, except 1) it isn't easily embeddable like a screenshot is, and 2) it's more legally dubious - if you cache an entire NYTimes article and lots of people read the cache on archive.org, then NYTimes could C&D archive.org for piracy, potentially.

      EDIT: it's still a huge fucking pain in the ass though; they should do both ideally, it's unlikely a github/gitlab bug issue tracker link will change.

      5 votes
      1. creesch
        Link Parent
        Your edit is exactly what my main issue with it is.

        Your edit is exactly what my main issue with it is.

        1 vote
  3. [2]
    DawnPaladin
    Link
    This is bad. Looks like scraping websites for LLMs is the new email spam: a parasitic way to make money by externalizing costs. We "solved" the email spam problem mostly by centralizing around a...

    This is bad. Looks like scraping websites for LLMs is the new email spam: a parasitic way to make money by externalizing costs. We "solved" the email spam problem mostly by centralizing around a few big email services that bear the costs of spam filtering. These days it's very difficult to run your own small email server. I hope that doesn't happen to hosting.

    It's not clear to me whether OpenAI and Anthropic are actually behind this massive surge. Both of them claim to respect robots.txt. They could be lying, but the user-agent strings identifying these scrapers are commonly faked. Browsers do it all the time--Chrome starts its user strings with "Mozilla/5.0" to this day for historical reasons. It would be very easy for smaller companies to impersonate OpenAI and Anthropic here. But we don't know.

    Crossposting this to OpenAI's developer forum. Maybe it will get some traction there.

    7 votes
    1. TheMeerkat
      Link Parent
      They're not, and they do. The typical Western companies you would think of first are handling their scraping… well, as ethically as your personal viewpoints on AI allow. See this screenshot from...

      It's not clear to me whether OpenAI and Anthropic are actually behind this massive surge. Both of them claim to respect robots.txt.

      They're not, and they do. The typical Western companies you would think of first are handling their scraping… well, as ethically as your personal viewpoints on AI allow.

      See this screenshot from the article: https://thelibre.news/content/images/2025/03/image-30a208ec17842ce4.png

      2 votes
  4. [2]
    SteeeveTheSteve
    Link
    I was just wondering why an increasing number of websites have botcheck pages! Can these AI companies be sued for this? Looked like some were trackable and there's no doubt they're harmful, these...

    I was just wondering why an increasing number of websites have botcheck pages! Can these AI companies be sued for this? Looked like some were trackable and there's no doubt they're harmful, these companies should pay for all the manpower and resources they've taken up as well as interrupted services.

    I wonder if an AI can be used to adapt and block them on the fly? Feed it logs and have it spot patterns.

    Off topic: What's with the guy complaining his girlfriend would have an issue with him having that Anubis anime girl on screen, is anime that misunderstood? It's a chibi character, which means it's designed to be fun/cute. I'm not sure I'd want to be with someone who saw that and got upset.

    5 votes
    1. ButteredToast
      Link Parent
      Some people seem to roughly equate anime with hentai or otherwise sexual content, no matter how innocent it is, so maybe that’s what’s going on. Some people also have extremely...

      Off topic: What's with the guy complaining his girlfriend would have an issue with him having that Anubis anime girl on screen, is anime that misunderstood? It's a chibi character, which means it's designed to be fun/cute. I'm not sure I'd want to be with someone who saw that and got upset.

      Some people seem to roughly equate anime with hentai or otherwise sexual content, no matter how innocent it is, so maybe that’s what’s going on. Some people also have extremely insecure/controlling/jealous partners, which can lead to disastrous results if said partner holds the anime ≈ hentai misunderstanding.

      1 vote