41
votes
Robots.txt governed the behavior of web crawlers for over thirty years; AI vendors are ignoring it or proliferating too fast to block
Link information
This data is scraped automatically and may be incorrect.
- Title
- The rise and fall of robots.txt
- Authors
- David Pierce
- Published
- Feb 14 2024
- Word count
- 3069 words
Honestly robots.txt has always been a voluntary, non-binding “gentlemen’s agreement” anyway. It doesn’t stop bad actors from acting badly. If the file is written thoughtlessly it can actually point said bad actors to areas of interest they might want to target.
I’m frankly shocked the idea wasn’t DOA like the Do-Not-Track header was. They’re two birds of a feather.
Robot.txt was for their convenience. Do Not Track is for ours.
It's for the convenience of both the website owner and the crawler. Editing a text file is easier than implementing a way to block crawlers, and web crawlers don't need heuristics to figure out when they're not welcome.
(This is true of implementing most Internet standards; people use them because they're mutually beneficial.)
I'm not sure about it not being a legal document. Has any court ruled on that? Seems like if an emoji can be a legal contract under some circumstances, maybe robots.txt could be too?
One argument that robots.txt doesn't matter legally is that no human read it. (Seems like that could be fixed with legislation that requires bots to abide by it?)
Anyway, there's always been smaller crawlers that ignore robots.txt. If you want to block them, you need to use other means. robots.txt just makes it easier to get cooperating crawlers to go away; you don't have to block them.
I don't think this convention is breaking down much more than usual, considering that OpenAI started respecting it, and the other big crawlers do too. Will that change?
For someone who wants to be indexed by Google's search engine but not have their web pages used to train Google's AI, it's a tough choice. But then again, this has always been true. Google's been using machine learning for a long time.
It's fascinating watching this new wave of AIs open up every crack in an internet that was already creaking under the weight of platform decay. A lot like how COVID exposed every weakness in healthcare and welfare systems.
This really seems like the moment that will eventually bookend this phase of the Internet, what the next will look like I have no idea.
"The old web is dying, the new one struggles to be born. Now is the time of nonsense."
From the article: