You don't need Anubis

[6]

FlippantGod

November 2

Link

I'll read the article, but I want to record my current thoughts on Anubis before that. I get stuck on Anubis more than an average user should, which reflects my browsers. It's a bad experience,...

I'll read the article, but I want to record my current thoughts on Anubis before that.

I get stuck on Anubis more than an average user should, which reflects my browsers.

It's a bad experience, especially when the rejection page never loads, which AFAIK is not the intended result. Probably a bad mix of a browser about to get rejected, and older hardware?

What it doesn't reflect is Anubis breaking, as it assumes there will be false positives and decides "worth it".

I struggle to be too frustrated by this for a few reasons.

It's aware of the shortcoming so the frustration is more when I find Anubis running in "surprising" places - think government and educational sites which IMO have no business blocking scrapers, let one an "acceptable" number of false positives.

Mainly though, Cloudflare is just so, so, so much more annoying, and also blocks me.

On one hand, we have open source software from a single dev, and on the other we have Cloudflare. Even if Anubis had a worse impact on me (which it might if it see's Cloudflare's adoption rate) it would still be less annoying than Cloudflare (or hey, recaptcha and friends).

Effectively:

Anubis chooses deliberate false positives
Anubis is maybe not worse than competition from largest tech companies
I mostly get annoyed at who chooses to run Anubis, rather than it existing at all, even if I personally tend to think websites shouldn't waste everyone's time with doubtful countermeasures
I find it pretty funny when very official sites choose by default to display the mascot character, so I guess that part is working as intended
I have no idea if Anubis works for the intended purpose; I suspect some sites deploying it do not even understand what it is or how it works

For bonus points, it does seem like there is activity continuously trying to improve it and reduce the rejection rate (from my user experience, not from tracking the repo).

12 votes

redwall_hp
November 2
Link Parent
Anubis will never block a well behaved scraper, and won't even run the proof of wok check in the first place. Any bot that correctly identifies itself in its user agent will entirely bypass the...

government and educational sites which IMO have no business blocking scrapers

Anubis will never block a well behaved scraper, and won't even run the proof of wok check in the first place. Any bot that correctly identifies itself in its user agent will entirely bypass the check. By design, it only runs the check against agents with the substring "Mozilla," which is every major browser since Netscape.

The goal is to be able to differentiate bots from users and rate limit them accordingly, to avoid existential issues from hosting costs going way up under LLM companies' residential botnets that evade throttling.

11 votes
DeaconBlue
November 2 (edited November 2)
Link Parent
I find this observation to be the opposite of my experience. Alternatively, I am just acting like a bit more botlike than usual (you know, those bots that check vaguely technical sounding titles...

For bonus points, it does seem like there is activity continuously trying to improve it and reduce the rejection rate (from my user experience, not from tracking the repo).

I find this observation to be the opposite of my experience. Alternatively, I am just acting like a bit more botlike than usual (you know, those bots that check vaguely technical sounding titles from clicks through bearblog, the most common bot).

It's aware of the shortcoming so the frustration is more when I find Anubis running in "surprising" places - think government and educational sites which IMO have no business blocking scrapers, let one an "acceptable" number of false positives.

This is probably my biggest frustration. I get it if I am using an esoteric browser that doesn't handle your chromium-specific functionality on a government website. It's annoying, but it's fine, and it's kind of a "me" problem to deal with. It's a whole different problem if I go to load the obligatory government website and get a "you are blocked, and we're not telling you why!" message with a mascot character.

8 votes
[2]
creesch
November 2
Link Parent
Out of curiosity, what browser and version of that browser are you running? And also what sort of extensions?

Out of curiosity, what browser and version of that browser are you running? And also what sort of extensions?

4 votes
1. FlippantGod
  November 3
  Link Parent
  False positives mostly from typically out of date versions of chromite on oldish android hardware. With quite strict js / cookie / etc policies.
  
  False positives mostly from typically out of date versions of chromite on oldish android hardware. With quite strict js / cookie / etc policies.
  
  2 votes
FlippantGod
November 3
Link Parent
Okay, read the article. Nicely succinct and I like it. I also like the suggestion here to serve no-js users instructions on what to do. I also appreciated the reminder that I was mostly interested...

Okay, read the article. Nicely succinct and I like it. I also like the suggestion here to serve no-js users instructions on what to do. I also appreciated the reminder that I was mostly interested in Anubis for investigating DDoS countermeasures without cloudflare.

[3]

Greg

November 2

Link

The biggest surprise to me here is that LLM bots apparently aren't executing JS - the web depends on it so heavily nowadays that I'm surprised they can get a decent content scrape on a lot of...

The biggest surprise to me here is that LLM bots apparently aren't executing JS - the web depends on it so heavily nowadays that I'm surprised they can get a decent content scrape on a lot of sites without actually rendering the page and letting the XHRs complete.

Out of interest, @fxgn, did you verify that the actual JS execution is the blocking part? I'm wondering if they're running the JS but perhaps deliberately blocking cookies (perhaps cookies that aren't set by an HTTP request) and/or navigation events, rather than JS as a whole. Seems like a good solution for now even if that is the case, but it'd be good to understand the mechanics too!

10 votes

[2]
fxgn (OP)
November 2
Link Parent
I initially tried a simpler solution just setting a cookie with a Set-Cookie header, and that didn't stop the bots. It is possible that they're only blocking cookies set through JS - I didn't...

I'm wondering if they're running the JS but perhaps deliberately blocking cookies

I initially tried a simpler solution just setting a cookie with a Set-Cookie header, and that didn't stop the bots. It is possible that they're only blocking cookies set through JS - I didn't verify that (I don't know why they would do that though). Those bots are unfortunately a black box and it's hard to test what does and doesn't stop them because the only way is to change the config and then wait for a while to check if any requests come through.

6 votes
1. Greg
  November 2
  Link Parent
  Cheers, that's good to know - you're absolutely right about them being a black box, so it's really helpful to piece together details like this when people are able to test them.
  
  Cheers, that's good to know - you're absolutely right about them being a black box, so it's really helpful to piece together details like this when people are able to test them.
  
  3 votes

[2]

DeaconBlue

November 2 (edited November 2)

Link

Anubis has been frustrating me since the beginning, and it seems to be getting worse. I have been blocked by Anubis on five different websites this week from my phone. I stick with my original...

Anubis has been frustrating me since the beginning, and it seems to be getting worse. I have been blocked by Anubis on five different websites this week from my phone.

I stick with my original statement on this issue. If you care enough to block legitimate users from accessing your site, just take the content offline.

This very basic strategy seems pretty reasonable to me. A cookie set by JavaScript and a reload seems fine. It isn't checking to see if you are running a specific browser or anything else.

7 votes

tauon
November 2 (edited November 2)
Link Parent
It’s a tough battle to fight. I actually share your opinion in principle, seemingly “low compute” users shouldn’t be excluded on the basis of being on old/slow hardware or using an “esoteric”...

It’s a tough battle to fight.

I actually share your opinion in principle, seemingly “low compute” users shouldn’t be excluded on the basis of being on old/slow hardware or using an “esoteric” browser or browser configuration without JS, but then I see blog posts and post-mortems from sites and services where their hosting costs alone suddenly exploded due to scraping, let alone potential downtime, and I’m far more understanding of their point of view again.

I’m definitely not an expert, but I’m guessing other solutions haven’t taken off because they’re too easily circumvented [once scraper bots have JS capabilities]?

13 votes

[4]

lynxy

November 2

Link

This is a tempting solution- I'm sure the same operation in Nginx (my reverse proxy / web-server of choice) is just as simple to implement- but I do worry about the effect this will have on users...

This is a tempting solution- I'm sure the same operation in Nginx (my reverse proxy / web-server of choice) is just as simple to implement- but I do worry about the effect this will have on users of noscript-type add-ons (also clearly an issue when using Anubis).

But then, blocking all scripts is a rather extreme approach, and if you do this then maybe you should expect swathes of the web to be inaccessible to you. I entirely understand the dislike of large third-party hosted frameworks, and the concerns about the possibilities of malicious JS, but I am all for minimalist same-site custom JS solutions. Vanilla JS has become quite powerful over the years, and can do some pretty cool things.

6 votes

[3]
fxgn (OP)
November 2
Link Parent
Well, since the verification here is only done server side, the JS only needs to be run once to set the cookie. You can even adjust the redirect page to show instructions to NoJS users on how to...

Well, since the verification here is only done server side, the JS only needs to be run once to set the cookie. You can even adjust the redirect page to show instructions to NoJS users on how to manually set the cookie in the dev tools, that way they won't have to enable JavaScript at all.

The example in the post is not meant to be a full-featured production solution, for example, you probably also need to add an exception for search engine scrapers if you want your site to be indexed correctly.

5 votes
1. [2]
  lynxy
  November 2
  Link Parent
  Yes! The idea of providing instructions to set the verification cookie to noscript users, who are more likely to be technically-knowledgeable enough to perform said steps, did cross my mind and it...
  
  Yes! The idea of providing instructions to set the verification cookie to noscript users, who are more likely to be technically-knowledgeable enough to perform said steps, did cross my mind and it seems a reasonable solution to me.
  
  As for search engine scrapers- I imagine you could just pick scrapers who are well behaved / publish the IP ranges that they scrape from (I'd rather Google not scrape my sites, to be honest), and carve out some exceptions to this scheme. I believe Kagi publish enough data for this.
  
  3 votes
  1. fxgn (OP)
    November 2
    Link Parent
    You can even probably automate this by serving a page with a button that would send an HTML form request to the server which would then set the cookie using just HTTP header. Haven't tried that...
    
    Yes! The idea of providing instructions to set the verification cookie to noscript users
    
    You can even probably automate this by serving a page with a button that would send an HTML form request to the server which would then set the cookie using just HTTP header. Haven't tried that but in theory should be pretty easy to implement.
    
    As for search engine scrapers- I imagine you could just pick scrapers who are well behaved / publish the IP ranges that they scrape from
    
    Yeah, although this would unfortunately mean excluding smaller web crawlers from accessing your site. Search scrapers are generally more well-behaved than LLM scrapers and use a consistent User-Agent. Anubis itself actually has a bunch of User-Agent blocking presets of different levels of strictness - maybe you can just use those?
    
    4 votes

Nihilego

November 2

Link

Anubis rarely gave me issues tbh, if it does, it does by VPN and/or browsers that are hardened. As someone who doesn’t develop web stuff but comes across anubis from time to time, what else is...

Anubis rarely gave me issues tbh, if it does, it does by VPN and/or browsers that are hardened.
As someone who doesn’t develop web stuff but comes across anubis from time to time, what else is there to use to avoid LLM bots?

6 votes

borntyping

November 3

Link

My thoughts on Anubis after reading the article: Anubis seems best combined with other approaches. While you could generate a proof of work for all the anubis-enabled sites you want to scrape,...

My thoughts on Anubis after reading the article:

Anubis seems best combined with other approaches. While you could generate a proof of work for all the anubis-enabled sites you want to scrape, that also means you now have a single identifier that can be tracked by a traditional rate-limiter. The computation cost goes up rapidly when you want to pretend your scraper is millions of different clients, which seems to be the approach some of them have been using.
There's some value in Anubis beyond just DDoS protection. I think some subset of it's users deploy it more as a political statement than as a necessity. There's a lot of frustration against "AI" right now, and a recurring theme is people feel powerless to do anything about it. Solutions like Anubis or Nightshade or Glaze get a lot of attention because it lets people do something about the AI industry.
Anubis is relatively easy to set up. We've all seen the blog posts where people ask "why use {{project}}, you could just write 10 lines of code?", but people use the project because it's maintained, comes with instructions, is easy to setup, etc, and most importantly means they don't have to maintain something themselves.

2 votes

Link information

17 comments