41 votes

Robots.txt governed the behavior of web crawlers for over thirty years; AI vendors are ignoring it or proliferating too fast to block

Posted February 18 by patience_limited

Tags: internet, search engines, web crawlers, internet archive, artificial intelligence, websites, intellectual property, robots exclusion protocol, data.training, robots txt, author.david pierce, source.the verge, long read

https://www.theverge.com/24067997/robots-txt-ai-text-file-web-crawlers-spiders

Link information

This data is scraped automatically and may be incorrect.

Title: The rise and fall of robots.txt
Authors: David Pierce
Published: Feb 14 2024
Word count: 3069 words

6 comments

[3]
balooga
February 18 (edited February 18)
Link
Honestly robots.txt has always been a voluntary, non-binding “gentlemen’s agreement” anyway. It doesn’t stop bad actors from acting badly. If the file is written thoughtlessly it can actually...

Honestly robots.txt has always been a voluntary, non-binding “gentlemen’s agreement” anyway. It doesn’t stop bad actors from acting badly. If the file is written thoughtlessly it can actually point said bad actors to areas of interest they might want to target.

I’m frankly shocked the idea wasn’t DOA like the Do-Not-Track header was. They’re two birds of a feather.

29 votes
1. [2]
  rish
  February 18
  Link Parent
  Robot.txt was for their convenience. Do Not Track is for ours.
  
  Robot.txt was for their convenience. Do Not Track is for ours.
  
  4 votes
  1. skybrian
    February 18
    Link Parent
    It's for the convenience of both the website owner and the crawler. Editing a text file is easier than implementing a way to block crawlers, and web crawlers don't need heuristics to figure out...
    
    It's for the convenience of both the website owner and the crawler. Editing a text file is easier than implementing a way to block crawlers, and web crawlers don't need heuristics to figure out when they're not welcome.
    
    (This is true of implementing most Internet standards; people use them because they're mutually beneficial.)
    
    15 votes
skybrian
February 18 (edited February 18)
Link
I'm not sure about it not being a legal document. Has any court ruled on that? Seems like if an emoji can be a legal contract under some circumstances, maybe robots.txt could be too? One argument...

I'm not sure about it not being a legal document. Has any court ruled on that? Seems like if an emoji can be a legal contract under some circumstances, maybe robots.txt could be too?

One argument that robots.txt doesn't matter legally is that no human read it. (Seems like that could be fixed with legislation that requires bots to abide by it?)

Anyway, there's always been smaller crawlers that ignore robots.txt. If you want to block them, you need to use other means. robots.txt just makes it easier to get cooperating crawlers to go away; you don't have to block them.

I don't think this convention is breaking down much more than usual, considering that OpenAI started respecting it, and the other big crawlers do too. Will that change?

For someone who wants to be indexed by Google's search engine but not have their web pages used to train Google's AI, it's a tough choice. But then again, this has always been true. Google's been using machine learning for a long time.

17 votes
Fiachra
February 18
Link
It's fascinating watching this new wave of AIs open up every crack in an internet that was already creaking under the weight of platform decay. A lot like how COVID exposed every weakness in...

It's fascinating watching this new wave of AIs open up every crack in an internet that was already creaking under the weight of platform decay. A lot like how COVID exposed every weakness in healthcare and welfare systems.

This really seems like the moment that will eventually bookend this phase of the Internet, what the next will look like I have no idea.

"The old web is dying, the new one struggles to be born. Now is the time of nonsense."

16 votes
patience_limited (OP)
February 18
Link
From the article:

From the article:

For three decades, a tiny text file has kept the internet from chaos. This text file has no particular legal or technical authority, and it’s not even particularly complicated. It represents a handshake deal between some of the earliest pioneers of the internet to respect each other’s wishes and build the internet in a way that benefitted everybody. It’s a mini constitution for the internet, written in code.

It’s called robots.txt and is usually located at yourwebsite.com/robots.txt. That file allows anyone who runs a website — big or small, cooking blog or multinational corporation — to tell the web who’s allowed in and who isn’t. Which search engines can index your site? What archival projects can grab a version of your page and save it? Can competitors keep tabs on your pages for their own files? You get to decide and declare that to the web.

It’s not a perfect system, but it works. Used to, anyway. For decades, the main focus of robots.txt was on search engines; you’d let them scrape your site and in exchange they’d promise to send people back to you. Now AI has changed the equation: companies around the web are using your site and its data to build massive sets of training data, in order to build models and products that may not acknowledge your existence at all.

The robots.txt file governs a give and take; AI feels to many like all take and no give. But there’s now so much money in AI, and the technological state of the art is changing so fast that many site owners can’t keep up. And the fundamental agreement behind robots.txt, and the web as a whole — which for so long amounted to “everybody just be cool” — may not be able to keep up either.

8 votes