I'll read the article, but I want to record my current thoughts on Anubis before that. I get stuck on Anubis more than an average user should, which reflects my browsers. It's a bad experience,...
I'll read the article, but I want to record my current thoughts on Anubis before that.
I get stuck on Anubis more than an average user should, which reflects my browsers.
It's a bad experience, especially when the rejection page never loads, which AFAIK is not the intended result. Probably a bad mix of a browser about to get rejected, and older hardware?
What it doesn't reflect is Anubis breaking, as it assumes there will be false positives and decides "worth it".
I struggle to be too frustrated by this for a few reasons.
It's aware of the shortcoming so the frustration is more when I find Anubis running in "surprising" places - think government and educational sites which IMO have no business blocking scrapers, let one an "acceptable" number of false positives.
Mainly though, Cloudflare is just so, so, so much more annoying, and also blocks me.
On one hand, we have open source software from a single dev, and on the other we have Cloudflare. Even if Anubis had a worse impact on me (which it might if it see's Cloudflare's adoption rate) it would still be less annoying than Cloudflare (or hey, recaptcha and friends).
Effectively:
Anubis chooses deliberate false positives
Anubis is maybe not worse than competition from largest tech companies
I mostly get annoyed at who chooses to run Anubis, rather than it existing at all, even if I personally tend to think websites shouldn't waste everyone's time with doubtful countermeasures
I find it pretty funny when very official sites choose by default to display the mascot character, so I guess that part is working as intended
I have no idea if Anubis works for the intended purpose; I suspect some sites deploying it do not even understand what it is or how it works
For bonus points, it does seem like there is activity continuously trying to improve it and reduce the rejection rate (from my user experience, not from tracking the repo).
Anubis will never block a well behaved scraper, and won't even run the proof of wok check in the first place. Any bot that correctly identifies itself in its user agent will entirely bypass the...
government and educational sites which IMO have no business blocking scrapers
Anubis will never block a well behaved scraper, and won't even run the proof of wok check in the first place. Any bot that correctly identifies itself in its user agent will entirely bypass the check. By design, it only runs the check against agents with the substring "Mozilla," which is every major browser since Netscape.
The goal is to be able to differentiate bots from users and rate limit them accordingly, to avoid existential issues from hosting costs going way up under LLM companies' residential botnets that evade throttling.
I find this observation to be the opposite of my experience. Alternatively, I am just acting like a bit more botlike than usual (you know, those bots that check vaguely technical sounding titles...
For bonus points, it does seem like there is activity continuously trying to improve it and reduce the rejection rate (from my user experience, not from tracking the repo).
I find this observation to be the opposite of my experience. Alternatively, I am just acting like a bit more botlike than usual (you know, those bots that check vaguely technical sounding titles from clicks through bearblog, the most common bot).
It's aware of the shortcoming so the frustration is more when I find Anubis running in "surprising" places - think government and educational sites which IMO have no business blocking scrapers, let one an "acceptable" number of false positives.
This is probably my biggest frustration. I get it if I am using an esoteric browser that doesn't handle your chromium-specific functionality on a government website. It's annoying, but it's fine, and it's kind of a "me" problem to deal with. It's a whole different problem if I go to load the obligatory government website and get a "you are blocked, and we're not telling you why!" message with a mascot character.
The biggest surprise to me here is that LLM bots apparently aren't executing JS - the web depends on it so heavily nowadays that I'm surprised they can get a decent content scrape on a lot of...
The biggest surprise to me here is that LLM bots apparently aren't executing JS - the web depends on it so heavily nowadays that I'm surprised they can get a decent content scrape on a lot of sites without actually rendering the page and letting the XHRs complete.
Out of interest, @fxgn, did you verify that the actual JS execution is the blocking part? I'm wondering if they're running the JS but perhaps deliberately blocking cookies (perhaps cookies that aren't set by an HTTP request) and/or navigation events, rather than JS as a whole. Seems like a good solution for now even if that is the case, but it'd be good to understand the mechanics too!
I initially tried a simpler solution just setting a cookie with a Set-Cookie header, and that didn't stop the bots. It is possible that they're only blocking cookies set through JS - I didn't...
I'm wondering if they're running the JS but perhaps deliberately blocking cookies
I initially tried a simpler solution just setting a cookie with a Set-Cookie header, and that didn't stop the bots. It is possible that they're only blocking cookies set through JS - I didn't verify that (I don't know why they would do that though). Those bots are unfortunately a black box and it's hard to test what does and doesn't stop them because the only way is to change the config and then wait for a while to check if any requests come through.
Cheers, that's good to know - you're absolutely right about them being a black box, so it's really helpful to piece together details like this when people are able to test them.
Cheers, that's good to know - you're absolutely right about them being a black box, so it's really helpful to piece together details like this when people are able to test them.
Anubis has been frustrating me since the beginning, and it seems to be getting worse. I have been blocked by Anubis on five different websites this week from my phone. I stick with my original...
Anubis has been frustrating me since the beginning, and it seems to be getting worse. I have been blocked by Anubis on five different websites this week from my phone.
I stick with my original statement on this issue. If you care enough to block legitimate users from accessing your site, just take the content offline.
This very basic strategy seems pretty reasonable to me. A cookie set by JavaScript and a reload seems fine. It isn't checking to see if you are running a specific browser or anything else.
It’s a tough battle to fight. I actually share your opinion in principle, seemingly “low compute” users shouldn’t be excluded on the basis of being on old/slow hardware or using an “esoteric”...
It’s a tough battle to fight.
I actually share your opinion in principle, seemingly “low compute” users shouldn’t be excluded on the basis of being on old/slow hardware or using an “esoteric” browser or browser configuration without JS, but then I see blog posts and post-mortems from sites and services where their hosting costs alone suddenly exploded due to scraping, let alone potential downtime, and I’m far more understanding of their point of view again.
I’m definitely not an expert, but I’m guessing other solutions haven’t taken off because they’re too easily circumvented [once scraper bots have JS capabilities]?
This is a tempting solution- I'm sure the same operation in Nginx (my reverse proxy / web-server of choice) is just as simple to implement- but I do worry about the effect this will have on users...
This is a tempting solution- I'm sure the same operation in Nginx (my reverse proxy / web-server of choice) is just as simple to implement- but I do worry about the effect this will have on users of noscript-type add-ons (also clearly an issue when using Anubis).
But then, blocking all scripts is a rather extreme approach, and if you do this then maybe you should expect swathes of the web to be inaccessible to you. I entirely understand the dislike of large third-party hosted frameworks, and the concerns about the possibilities of malicious JS, but I am all for minimalist same-site custom JS solutions. Vanilla JS has become quite powerful over the years, and can do some pretty cool things.
Well, since the verification here is only done server side, the JS only needs to be run once to set the cookie. You can even adjust the redirect page to show instructions to NoJS users on how to...
Well, since the verification here is only done server side, the JS only needs to be run once to set the cookie. You can even adjust the redirect page to show instructions to NoJS users on how to manually set the cookie in the dev tools, that way they won't have to enable JavaScript at all.
The example in the post is not meant to be a full-featured production solution, for example, you probably also need to add an exception for search engine scrapers if you want your site to be indexed correctly.
Yes! The idea of providing instructions to set the verification cookie to noscript users, who are more likely to be technically-knowledgeable enough to perform said steps, did cross my mind and it...
Yes! The idea of providing instructions to set the verification cookie to noscript users, who are more likely to be technically-knowledgeable enough to perform said steps, did cross my mind and it seems a reasonable solution to me.
As for search engine scrapers- I imagine you could just pick scrapers who are well behaved / publish the IP ranges that they scrape from (I'd rather Google not scrape my sites, to be honest), and carve out some exceptions to this scheme. I believe Kagi publish enough data for this.
You can even probably automate this by serving a page with a button that would send an HTML form request to the server which would then set the cookie using just HTTP header. Haven't tried that...
Yes! The idea of providing instructions to set the verification cookie to noscript users
You can even probably automate this by serving a page with a button that would send an HTML form request to the server which would then set the cookie using just HTTP header. Haven't tried that but in theory should be pretty easy to implement.
As for search engine scrapers- I imagine you could just pick scrapers who are well behaved / publish the IP ranges that they scrape from
Yeah, although this would unfortunately mean excluding smaller web crawlers from accessing your site. Search scrapers are generally more well-behaved than LLM scrapers and use a consistent User-Agent. Anubis itself actually has a bunch of User-Agent blocking presets of different levels of strictness - maybe you can just use those?
Anubis rarely gave me issues tbh, if it does, it does by VPN and/or browsers that are hardened. As someone who doesn’t develop web stuff but comes across anubis from time to time, what else is...
Anubis rarely gave me issues tbh, if it does, it does by VPN and/or browsers that are hardened.
As someone who doesn’t develop web stuff but comes across anubis from time to time, what else is there to use to avoid LLM bots?
I'll read the article, but I want to record my current thoughts on Anubis before that.
I get stuck on Anubis more than an average user should, which reflects my browsers.
It's a bad experience, especially when the rejection page never loads, which AFAIK is not the intended result. Probably a bad mix of a browser about to get rejected, and older hardware?
What it doesn't reflect is Anubis breaking, as it assumes there will be false positives and decides "worth it".
I struggle to be too frustrated by this for a few reasons.
It's aware of the shortcoming so the frustration is more when I find Anubis running in "surprising" places - think government and educational sites which IMO have no business blocking scrapers, let one an "acceptable" number of false positives.
Mainly though, Cloudflare is just so, so, so much more annoying, and also blocks me.
On one hand, we have open source software from a single dev, and on the other we have Cloudflare. Even if Anubis had a worse impact on me (which it might if it see's Cloudflare's adoption rate) it would still be less annoying than Cloudflare (or hey, recaptcha and friends).
Effectively:
For bonus points, it does seem like there is activity continuously trying to improve it and reduce the rejection rate (from my user experience, not from tracking the repo).
Anubis will never block a well behaved scraper, and won't even run the proof of wok check in the first place. Any bot that correctly identifies itself in its user agent will entirely bypass the check. By design, it only runs the check against agents with the substring "Mozilla," which is every major browser since Netscape.
The goal is to be able to differentiate bots from users and rate limit them accordingly, to avoid existential issues from hosting costs going way up under LLM companies' residential botnets that evade throttling.
I find this observation to be the opposite of my experience. Alternatively, I am just acting like a bit more botlike than usual (you know, those bots that check vaguely technical sounding titles from clicks through bearblog, the most common bot).
This is probably my biggest frustration. I get it if I am using an esoteric browser that doesn't handle your chromium-specific functionality on a government website. It's annoying, but it's fine, and it's kind of a "me" problem to deal with. It's a whole different problem if I go to load the obligatory government website and get a "you are blocked, and we're not telling you why!" message with a mascot character.
Out of curiosity, what browser and version of that browser are you running? And also what sort of extensions?
The biggest surprise to me here is that LLM bots apparently aren't executing JS - the web depends on it so heavily nowadays that I'm surprised they can get a decent content scrape on a lot of sites without actually rendering the page and letting the XHRs complete.
Out of interest, @fxgn, did you verify that the actual JS execution is the blocking part? I'm wondering if they're running the JS but perhaps deliberately blocking cookies (perhaps cookies that aren't set by an HTTP request) and/or navigation events, rather than JS as a whole. Seems like a good solution for now even if that is the case, but it'd be good to understand the mechanics too!
I initially tried a simpler solution just setting a cookie with a
Set-Cookieheader, and that didn't stop the bots. It is possible that they're only blocking cookies set through JS - I didn't verify that (I don't know why they would do that though). Those bots are unfortunately a black box and it's hard to test what does and doesn't stop them because the only way is to change the config and then wait for a while to check if any requests come through.Cheers, that's good to know - you're absolutely right about them being a black box, so it's really helpful to piece together details like this when people are able to test them.
Anubis has been frustrating me since the beginning, and it seems to be getting worse. I have been blocked by Anubis on five different websites this week from my phone.
I stick with my original statement on this issue. If you care enough to block legitimate users from accessing your site, just take the content offline.
This very basic strategy seems pretty reasonable to me. A cookie set by JavaScript and a reload seems fine. It isn't checking to see if you are running a specific browser or anything else.
It’s a tough battle to fight.
I actually share your opinion in principle, seemingly “low compute” users shouldn’t be excluded on the basis of being on old/slow hardware or using an “esoteric” browser or browser configuration without JS, but then I see blog posts and post-mortems from sites and services where their hosting costs alone suddenly exploded due to scraping, let alone potential downtime, and I’m far more understanding of their point of view again.
I’m definitely not an expert, but I’m guessing other solutions haven’t taken off because they’re too easily circumvented [once scraper bots have JS capabilities]?
This is a tempting solution- I'm sure the same operation in Nginx (my reverse proxy / web-server of choice) is just as simple to implement- but I do worry about the effect this will have on users of noscript-type add-ons (also clearly an issue when using Anubis).
But then, blocking all scripts is a rather extreme approach, and if you do this then maybe you should expect swathes of the web to be inaccessible to you. I entirely understand the dislike of large third-party hosted frameworks, and the concerns about the possibilities of malicious JS, but I am all for minimalist same-site custom JS solutions. Vanilla JS has become quite powerful over the years, and can do some pretty cool things.
Well, since the verification here is only done server side, the JS only needs to be run once to set the cookie. You can even adjust the redirect page to show instructions to NoJS users on how to manually set the cookie in the dev tools, that way they won't have to enable JavaScript at all.
The example in the post is not meant to be a full-featured production solution, for example, you probably also need to add an exception for search engine scrapers if you want your site to be indexed correctly.
Yes! The idea of providing instructions to set the verification cookie to noscript users, who are more likely to be technically-knowledgeable enough to perform said steps, did cross my mind and it seems a reasonable solution to me.
As for search engine scrapers- I imagine you could just pick scrapers who are well behaved / publish the IP ranges that they scrape from (I'd rather Google not scrape my sites, to be honest), and carve out some exceptions to this scheme. I believe Kagi publish enough data for this.
You can even probably automate this by serving a page with a button that would send an HTML form request to the server which would then set the cookie using just HTTP header. Haven't tried that but in theory should be pretty easy to implement.
Yeah, although this would unfortunately mean excluding smaller web crawlers from accessing your site. Search scrapers are generally more well-behaved than LLM scrapers and use a consistent User-Agent. Anubis itself actually has a bunch of User-Agent blocking presets of different levels of strictness - maybe you can just use those?
Anubis rarely gave me issues tbh, if it does, it does by VPN and/or browsers that are hardened.
As someone who doesn’t develop web stuff but comes across anubis from time to time, what else is there to use to avoid LLM bots?