24 votes

The great LLM scrape

4 comments

  1. 0x29A
    (edited )
    Link
    Yeah, lost my ability to see the number of RSS subscribers I have on bearblog because they had to disable the feature due to bots making it useless. Just another one of a thousand problems with...

    Yeah, lost my ability to see the number of RSS subscribers I have on bearblog because they had to disable the feature due to bots making it useless. Just another one of a thousand problems with LLMs and the hostile powers and capital that own them

    Even in a world where these things do exist and are hostile in many other ways, they can't do the bare minimum of respecting anything about the web, not even web traffic, thinking they have the right to put such pressure on site owners while also vacuuming up all the content. The hostility is the point.

    edit: here's some news of Perplexity's underhanded tactics in particular

    12 votes
  2. skybrian
    Link
    The legitimate search engine crawlers publish the ip addresses they crawl from, which provides a way to filter out their imitators. But it’s yet another thing to implement in the cat-and-mouse game.

    The legitimate search engine crawlers publish the ip addresses they crawl from, which provides a way to filter out their imitators. But it’s yet another thing to implement in the cat-and-mouse game.

    9 votes
  3. lou
    Link
    I'm not versed in programming but I found this post interesting because it talks about some practical consequences of LLM companies scraping everything.

    I'm not versed in programming but I found this post interesting because it talks about some practical consequences of LLM companies scraping everything.

    7 votes
  4. gco
    Link
    I've personally felt the impact of this, and I'm sure I'm not the only one. A few years ago, when GDPR legislation came into place to require sites to let users manage their cookies, every site...

    I've personally felt the impact of this, and I'm sure I'm not the only one. A few years ago, when GDPR legislation came into place to require sites to let users manage their cookies, every site would flash a sort of pop up making you choose (Many still do). Now, going to most websites requires spending about a second or so doing nothing while they verify you; thankfully this verification is not as hostile as reCaptcha, but I dread it might head that way.

    The other way I've been impacted is in my ability to run small scripts or automations that are dependent on information published on websites. For example, I love going to concerts but the demand is so high that I sometimes miss out on tickets, I built a small workflow that checks the marketplace for secondary tickets to alert me when some become available. That no longer works thanks to scraping protections.

    4 votes