Block AI scrapers with Anubis

[10]

DeaconBlue

March 17

Link

This is what I expected and also ridiculous in my opinion. If you're going to go this far out of your way to prevent people from accessing your stuff, just write it in a notebook and skip the website.

This will also lock out users who have JavaScript disabled, prevent your server from being indexed in search engines, require users to have HTTP cookies enabled, and require users to spend time solving the proof-of-work challenge.

The most hilarious part about how Anubis is implemented is that it triggers challenges for every request with a User-Agent containing "Mozilla".

This does mean that users using text-only browsers or older machines where they are unable to update their browser will be locked out of services protected by Anubis. This is a tradeoff that I am not happy about, but it is the world we live in now.

This is what I expected and also ridiculous in my opinion. If you're going to go this far out of your way to prevent people from accessing your stuff, just write it in a notebook and skip the website.

25 votes

hungariantoast (OP)
March 17 (edited March 17)
Link Parent
I think your contempt is misplaced. Anubis is designed to prevent malicious, infrastructure-crippling abuse by AI crawlers, like what SourceHut is experiencing right now. For some services, these...

I think your contempt is misplaced. Anubis is designed to prevent malicious, infrastructure-crippling abuse by AI crawlers, like what SourceHut is experiencing right now. For some services, these abusive crawlers can inflict such a burden, that they make hosting costs untenable. They can quite literally be an existential threat to some sites.

Anubis prevents this kind of abuse, and is being adopted in place of other, more annoying and restricting measures. Such as Cloudflare's "verify you are human" check, or requiring users to make accounts and log in before being given access.

Freedesktop.org and Gnome have been running Anubis for their GitLab instances for a while now, and it has largely prevented the abusive AI crawling they were experiencing. Before they implemented Anubis, there was talk of placing these instances behind a login wall, where they would be much less accessible than they are now.

So again, I think your contempt here is misplaced. Anubis is a valid solution to a problem caused by bad actors on the web and I think that's who you should direct your contempt at, not the indie developer who built a lightweight, free, and open-source solution, on their own time.

Anubis isn't about preventing access to data, it's about preventing abuse of web norms by AI crawlers.

And just for reference, Anubis takes less than one second to complete when I visit the test link on my Dell Chromebook 13 from 2016. I only have to pass the test once, as long as the site's cookies remain in my browser. In terms of practical access, Anubis is the lesser evil.

26 votes
[3]
creesch
March 17
Link Parent
Can I ask why you find it ridiculous? For regular users there is actually no active action involved, it is basically just a cryptographic puzzle the browser will automatically solve before you get...

Can I ask why you find it ridiculous? For regular users there is actually no active action involved, it is basically just a cryptographic puzzle the browser will automatically solve before you get to see the data. You can argue that it takes up unnecessary compute power on your computer. But, to be honest, if you have JavaScript enabled there are many other ways that is the case. Many of them for much more ridiculous reasons.

It also isn't a new principle, here is a post from a year ago on Tildes using hashcash to deal with signup spam.

11 votes
1. [2]
  DeaconBlue
  March 17
  Link Parent
  I copied the relevant parts in my original post. By implementing this, you're willingly throwing out users that have underpowered machines or are in situations where they can't (or won't for any...
  
  Can I ask why you find it ridiculous?
  
  I copied the relevant parts in my original post.
  
  By implementing this, you're willingly throwing out users that have underpowered machines or are in situations where they can't (or won't for any number of reasons) use a modern browser or allow arbitrary javascript to run. By going this route you are already accepting that you won't be found by web crawlers, so you may as well use a lower tech solution like a content sharing token or something that doesn't lock out those users.
  
  It's a solution looking for a problem, not the other way around.
  
  13 votes
  1. balooga
    March 17
    Link Parent
    You can’t allow search crawlers and also block AI scrapers. If one gets in, they both do — and the inverse. If blocking scrapers is such a high priority for someone, they’re going to have to weigh...
    
    You can’t allow search crawlers and also block AI scrapers. If one gets in, they both do — and the inverse. If blocking scrapers is such a high priority for someone, they’re going to have to weigh that against showing up on Google. I assume most will land in favor of discoverability, but not everyone. For those outliers, the nuclear option remains on the table.
    
    Of course, if this sort of thing becomes widespread, the scrapers will just start executing JavaScript. The proof-of-work check is clever but if bots start executing it, Anubis will be little more than a speed bump. It’s yet another digital arms race.
    
    11 votes
[5]
Minori
March 20
Link Parent
I get the purpose of Anubis for personal sites, but blocking all search engines is a death knell for discoverability. I guess it further empowers social media and other link trees?

prevent your server from being indexed in search engines

I get the purpose of Anubis for personal sites, but blocking all search engines is a death knell for discoverability. I guess it further empowers social media and other link trees?

3 votes
1. [4]
  hungariantoast (OP)
  March 21
  Link Parent
  Actually, it doesn't seem like (all) search engines are blocked (anymore): https://github.com/TecharoHQ/anubis/blob/main/cmd/anubis/botPolicies.json And here's a pull request that added an...
  
  Actually, it doesn't seem like (all) search engines are blocked (anymore):
  
  https://github.com/TecharoHQ/anubis/blob/main/cmd/anubis/botPolicies.json
  
  And here's a pull request that added an allowance for Kagi's crawler:
  
  https://github.com/TecharoHQ/anubis/pull/44
  
  3 votes
  1. [3]
    stu2b50
    March 21
    Link Parent
    Seems like that whitelist is just a regex on the user agent, which doesn’t seem like it’ll scale. Scraping bots will just start calling themselves kagi or google’s crawlers. Seems like at minimum...
    
    Seems like that whitelist is just a regex on the user agent, which doesn’t seem like it’ll scale. Scraping bots will just start calling themselves kagi or google’s crawlers.
    
    Seems like at minimum you’d need some kind signature system to ensure the other party really is Google or kagi, but it’s unknown if search engines would care enough to play ball.
    
    2 votes
    
    [2]
    hungariantoast (OP)
    March 21
    Link Parent
    
    Hey, thanks for the PR! Do you know if Kagi publishes their IP ranges? I'm going to be making an improvement that only allows User-Agent matching if the IP is in a published range and I want to get that data ahead of time.
    
    4 votes
    
    tauon
    March 22
    Link Parent
    Does this mean Kagi/small search engines might have a leg up on the big players if this becomes widespread? Fantasizing now, but could this mean Google might actually push for an end to the insane...
    
    Does this mean Kagi/small search engines might have a leg up on the big players if this becomes widespread?
    
    Fantasizing now, but could this mean Google might actually push for an end to the insane LLM crawling levels if enough sites were to implement something (that also remains effective in blocking) like Anubis?
    
    1 vote

[2]

menturi

March 17

Link

Would this also effectively block archival websites like the WaybackMachine on archive.org from working properly?

16 votes

unkz
March 21
Link Parent
Absolutely, yes. Archive.org includes Mozilla in their user agent and would trigger this.

Absolutely, yes. Archive.org includes Mozilla in their user agent and would trigger this.

6 votes

[10]

zestier

March 17

Link

This makes me wonder if there's a way to force those bots, at least the ones with JS enabled, to do something meaningful. Like if rather than giving them a meaningless challenge they were fed...

This makes me wonder if there's a way to force those bots, at least the ones with JS enabled, to do something meaningful. Like if rather than giving them a meaningless challenge they were fed something to execute from one of those distributed compute research projects. I assume it wouldn't work, but sounds fun.

6 votes

[2]
heraplem
March 17
Link Parent
This has been discussed before in the context of cryptocurrencies. The theoretical problem is that, if the work isn't inherently worthless, then someone could stand to benefit from it, and so it...

Like if rather than giving them a meaningless challenge they were fed something to execute from one of those distributed compute research projects. I assume it wouldn't work, but sounds fun.

This has been discussed before in the context of cryptocurrencies. The theoretical problem is that, if the work isn't inherently worthless, then someone could stand to benefit from it, and so it would have less value as a deterrent.

7 votes
1. zestier
  March 17 (edited March 17)
  Link Parent
  I'm not sure I completely buy that line of reasoning. The work being of value to someone doesn't make it of value to the entity executing it. In the most extreme example, if the challenge involved...
  
  I'm not sure I completely buy that line of reasoning. The work being of value to someone doesn't make it of value to the entity executing it. In the most extreme example, if the challenge involved mining Bitcoin to my wallet these LLM scraping companies really don't want to spend their compute power doing that.
  
  To me it seems like the issue is finding a problem that fulfills all of:
  
  Doesn't create controversial incentives like mining crypto would
  
  Is something the scraping companies don't actually want to be shelling money out of their bottom line to do, which is probably just about anything they don't directly profit from
  
  Meets the technical requirements for being the right level of difficult to solve while also being trivial to verify
  
  4 votes
[7]
creesch
March 17
Link Parent
In theory, yes. In practice, the problem is that you can't really tell if it is actually bot or a regular user. And you really don't want to be caught putting normal users through this.

In theory, yes. In practice, the problem is that you can't really tell if it is actually bot or a regular user. And you really don't want to be caught putting normal users through this.

2 votes
1. [6]
  zestier
  March 17
  Link Parent
  Although the whole point of projects like this one is that they already are? Like it's already making users solve problems, just meaningless ones. It's just declaring it to be fine because...
  
  Although the whole point of projects like this one is that they already are? Like it's already making users solve problems, just meaningless ones. It's just declaring it to be fine because individuals aren't harassing the server so heavily that the small proof-of-work checks add up to be that impactful on the end-user devices.
  
  I had actually originally considered writing my original comment about crypto mining to try recoup some of the expense of serving all these useless bot requests, but since random users would get sucked in I figured that is likely to become a controversy in a way that contributing to cancer research or something wouldn't.
  
  2 votes
  1. [5]
    creesch
    March 17
    Link Parent
    Well yeah, a one time operation which importantly can be validated almost instantly so the time the user is waiting for the data is as short as possible. That is quite different from continuously...
    
    Like it's already making users solve problems, just meaningless ones.
    
    Well yeah, a one time operation which importantly can be validated almost instantly so the time the user is waiting for the data is as short as possible. That is quite different from continuously asking for compute power in a way you'd need to do the sort of thing you propose.
    
    1 vote
    
    [3]
    unkz
    March 21
    Link Parent
    I think it's actually practical. The idea would be to make the computer perform the bitcoin hash, but with a smaller number of required zeros, so that it could be performed in a predictable period...
    
    I think it's actually practical. The idea would be to make the computer perform the bitcoin hash, but with a smaller number of required zeros, so that it could be performed in a predictable period of time, but still with a chance of also satisfying the current bitcoin requirement.
    
    For example, assume bitcoin currently needs 16 zeros -- the browser has to provide a solution with say, 8 zeros. That's probably tolerable for a browser to calculate in a reasonable time period, is also instantly verifiable, and has a 1/256 chance of hitting a 16 zero solution as well.
    
    However, I firmly believe that crypto in every one of its forms is basically the worst thing to ever happen to technology, so I definitely wouldn't support this.
    
    2 votes
    
    [2]
    F13
    March 21
    Link Parent
    Just to be clear, when you say You're saying cryptocurrency, and not cryptography, right?
    
    Just to be clear, when you say
    
    I firmly believe that crypto in every one of its forms is basically the worst thing to ever happen to technology
    
    You're saying cryptocurrency, and not cryptography, right?
    
    3 votes
    
    unkz
    March 21
    Link Parent
    Yes, cryptocurrency,
    
    Yes, cryptocurrency,
    
    zestier
    March 17
    Link Parent
    Might some of that just be implementation detail? It already seems that the thing that is preventing recompute for normal users is that the client stores a token that's valid for a week so that...
    
    Might some of that just be implementation detail? It already seems that the thing that is preventing recompute for normal users is that the client stores a token that's valid for a week so that they don't get new challenges. Surely these "temp auth"-style tokens could be issued in a variety of other ways that make it problem-agnostic.
    
    Admittedly I don't know much about the validation side of distributed research projects so maybe the validation is far too heavy. Ideally the validation is trivial and the difficulty scalable. Those are both properties of crypto mining, but I was trying to come up with a less controversial substitute even though it probably wouldn't work as well in practice.

[2]

hungariantoast (OP)

March 17

Link

https://github.com/Xe/x/tree/master/cmd/anubis See also: Nepenthes: a tarpit intended to catch web crawlers

https://github.com/Xe/x/tree/master/cmd/anubis

3 votes

Wulfsta
March 17
Link Parent
See here for a similar discussion.

See here for a similar discussion.

2 votes

[5]

Carrie

March 17

Link

Can someone ELI5 or maybe 10? Probably the whole article but mostly this part: I want to understand, but my tech literacy is not high enough. Thank you.

Can someone ELI5 or maybe 10? Probably the whole article but mostly this part:

A majority of the AI scrapers are not well-behaved, and they will ignore your robots.txt, ignore your User-Agent blocks, and ignore your X-Robots-Tag headers. They will scrape your site until it falls over, and then they will scrape it some more. They will click every link on every link on every link viewing the same pages over and over and over and over. Some of them will even click on the same link multiple times in the same second. It's madness and unsustainable.

I want to understand, but my tech literacy is not high enough.

Thank you.

1 vote

creesch
March 17
Link Parent
It is based on similar principles as hashcash. What is hashcash, you ask? Great question, I was equally confused a year ago when it came up in a different article and wrote out my understanding if...
It is based on similar principles as hashcash. What is hashcash, you ask? Great question, I was equally confused a year ago when it came up in a different article and wrote out my understanding if it. For your convenience, here is that comment as a quote.
It took me a little while to fully understand what is being done now as the blog post, repo and even the wiki page all seem to suffer from a bit of a "curse of knowledge" level skipping over basics.

Basically, when a user fills in their data and hits the "signup" button something along the following lines happens:

The user data isn't actually send to the server straight away.

Instead the client is requested to first hash (part of) of the user data.

Normally, that isn't as much of an issue. But the client has gotten an extra challenge: "The hash needs to have X amount of leading zeros".

Hashing with the same data will always result in the same hash.

So to get a new hash you need to add other data. To get to a specific amount of leading zeros in your hash outcome, you need a specific bit of data. This is called the nonce.

The client does not know what that specific bit of data is, so it has to calculate multiple hashes adding different bits of data until it finally gets a hash with the correct amount of leading zeros.

Once the client finds a combination of user data and nonce resulting in the right amount of leading zeros it sends all three (user data, hash and nonce) to the server.

The server is provided the nonce on a silver platter so can very easily verify the hash.

For a normal user on a normal computer the extra compute power needed to do the hashing isn't really an issue. But for mass spammers the added computational power at their scale makes it really expensive.

I still feel like I wrote it out like "draw the rest of the owl". But hopefully it does clear up things a bit for people after me. Who, just like me, know just enough about this sort of thing to want to know how it work but also not quite enough to make it effortless :P
Maybe even simpler, before a client is served a page it is asked to solve a little puzzle. For normal browsers this isn't really an issue, for AI agents it is as they often can't do more than just scrape data. Or if they can execute code (javascript) it is often in a limited fashion, at least according to the author.
7 votes
[3]
stu2b50
March 17
Link Parent
There are some conventions for how bots (which in practice just means HTTP requests not explicitly made by humans) can access your site. Robots.txt is a simple text file that lists directories in...

There are some conventions for how bots (which in practice just means HTTP requests not explicitly made by humans) can access your site. Robots.txt is a simple text file that lists directories in the site heirarchy and whether or not bots (or particular bots) can access. The other two are variants of this.

These are just convention, though. There's nothing to stop the bot from ignoring them. The fundamental trouble is that it's very hard to distinguish an HTTP request from a human and an HTTP request from an automated program of some kind.

4 votes
1. [2]
  Carrie
  March 19
  Link Parent
  I see so it’s like having a “no soliciting sign” outside your door and someone knocking anyway ? There are no rules or enforcement as it stands ?
  
  I see so it’s like having a “no soliciting sign” outside your door and someone knocking anyway ?
  
  There are no rules or enforcement as it stands ?
  
  3 votes
  1. balooga
    March 19
    Link Parent
    That’s correct. I’m honestly surprised it ever gained any traction because it has no teeth, but it’s become a de facto standard, leading a lot of people to assume it is an actual functioning...
    
    That’s correct. I’m honestly surprised it ever gained any traction because it has no teeth, but it’s become a de facto standard, leading a lot of people to assume it is an actual functioning control mechanism.
    
    Actually, a poorly written robots.txt can steer bad actors toward sensitive content you intended to hide from them.
    
    3 votes

Link information

29 comments