24 votes

The great LLM scrape

Posted August 4 by lou

Tags: internet, artificial intelligence, language models.large, scraping.web, bearblog.hermans blog, author.herman martinus

https://herman.bearblog.dev/the-great-scrape/

Link information

This data is scraped automatically and may be incorrect.

Published: Mar 26 2025
Word count: 766 words

4 comments

0x29A
August 4 (edited August 4)
Link
Yeah, lost my ability to see the number of RSS subscribers I have on bearblog because they had to disable the feature due to bots making it useless. Just another one of a thousand problems with...

Yeah, lost my ability to see the number of RSS subscribers I have on bearblog because they had to disable the feature due to bots making it useless. Just another one of a thousand problems with LLMs and the hostile powers and capital that own them

Even in a world where these things do exist and are hostile in many other ways, they can't do the bare minimum of respecting anything about the web, not even web traffic, thinking they have the right to put such pressure on site owners while also vacuuming up all the content. The hostility is the point.

edit: here's some news of Perplexity's underhanded tactics in particular

12 votes
skybrian
August 4
Link
The legitimate search engine crawlers publish the ip addresses they crawl from, which provides a way to filter out their imitators. But it’s yet another thing to implement in the cat-and-mouse game.

The legitimate search engine crawlers publish the ip addresses they crawl from, which provides a way to filter out their imitators. But it’s yet another thing to implement in the cat-and-mouse game.

9 votes
lou (OP)
August 4
Link
I'm not versed in programming but I found this post interesting because it talks about some practical consequences of LLM companies scraping everything.

I'm not versed in programming but I found this post interesting because it talks about some practical consequences of LLM companies scraping everything.

7 votes
gco
August 5
Link
I've personally felt the impact of this, and I'm sure I'm not the only one. A few years ago, when GDPR legislation came into place to require sites to let users manage their cookies, every site...

I've personally felt the impact of this, and I'm sure I'm not the only one. A few years ago, when GDPR legislation came into place to require sites to let users manage their cookies, every site would flash a sort of pop up making you choose (Many still do). Now, going to most websites requires spending about a second or so doing nothing while they verify you; thankfully this verification is not as hostile as reCaptcha, but I dread it might head that way.

The other way I've been impacted is in my ability to run small scripts or automations that are dependent on information published on websites. For example, I love going to concerts but the demand is so high that I sometimes miss out on tickets, I built a small workflow that checks the marketplace for secondary tickets to alert me when some become available. That no longer works thanks to scraping protections.

4 votes