I am not sure that I can go along with the title. It does showcase the problem is real and that it is affecting a lot of wildly different organizations. Organizations desperate enough to start...
Exemplary
I am not sure that I can go along with the title. It does showcase the problem is real and that it is affecting a lot of wildly different organizations. Organizations desperate enough to start looking for novel solutions to this mass assault by scraper bots.
But without traffic numbers it is difficult to say if Anubis works and to what degree it works. Though that is difficult to prove anyway as one of the issues is that a lot of the bots don't respect the rules for robot.txt and pretend to be normal browsers.
The original premise to get Anubis to work is the assumption that these bots have limited JavaScript capabilities and always seem to have a User-Agent containing "Mozilla". Since then there also has been an update detailing some more tuning and other detection methods.
I do think Anubis does work, to some degree, I am not sure to what degree exactly. I also think that once it gains traction a lot of these bots will be adjusted to account for some roadblocks Anubis puts in the way. Which will lessen the effectiveness of Anubis until adjustments can be made. In other words, it will probably turn into an arms race.
Though Anubis has one thing going for it, the compute power needed for these scrapers to do the hashing asked from there will make it more expensive to run them. At the very least it will reduce some of these bots. Though, as I said in a previous thread about Anubis, as long as there insane amounts of VC money involved a lot of these companies will just throw more compute against the issue anyway.
I also immediately thought "arms race", much like what is fought by any Anti-Abuse organization for a major web app. Ironic. Then it may have the unintended net effect of creating an even more...
I also immediately thought "arms race", much like what is fought by any Anti-Abuse organization for a major web app.
Ironic. Then it may have the unintended net effect of creating an even more rarified market for AI training leading to even fewer players at the AI table and move us even further toward AI monopolies than we already are.
This was the first one I could find, but a couple of the big groups that have deployed Anubis have shown bandwidth-usage graphs that show a huge dropoff with Anubis in place.
This was the first one I could find, but a couple of the big groups that have deployed Anubis have shown bandwidth-usage graphs that show a huge dropoff with Anubis in place.
The original premise is just to increase the cost of being badly-behaved enough that it's no longer worth it. Empirically, the problematic scrapers masquerade as browsers and rotate through large...
The original premise is just to increase the cost of being badly-behaved enough that it's no longer worth it. Empirically, the problematic scrapers masquerade as browsers and rotate through large numbers of (largely residential¹) IP addresses. If you clearly identify yourself as a bot useragent, or maintain a consistent IP address, then you mostly bypass the expense of Anubis's checks but can easily be throttled by traditional means.
The mechanism is also not really subject to an arms race; it is, to bend the metaphor quite a lot, a nuclear option. There is no getting around doing the work required for the proof-of-work.
¹ Yes, this probably means that a bunch of these are being routed through people's hacked smart toasters. Remember, the "S" in "IoT" is for "security".
There is, or was depending on what changes they made. If you read the original post I linked it clearly states it is only acting based on a few factors, like the user agent. Meaning a simple user...
There is no getting around doing the work required for the proof-of-work.
There is, or was depending on what changes they made. If you read the original post I linked it clearly states it is only acting based on a few factors, like the user agent. Meaning a simple user agent change already gets bots around it. Going by em-dash's comment here this still seems to be the case.
I think you are possibly confusing the original premise of hashcash with the premise Anubis.
Yes, but at that point they're giving a strong signal that they are very likely a bot. This deals with badly behaving bots that try to pretend they are users. How you deal with bots that say...
Yes, but at that point they're giving a strong signal that they are indeed very likely a bot.
This deals with badly behaving bots that try to pretend they are users. How you deal with bots that say they're bots is an implementation detail and more personal choice left to you.
What Rectangle said. If a bot sends a non-browser useragent, then server operators can ban away by useragent without worrying they'll hit legitimate browser-based users. The useragent check is...
What Rectangle said. If a bot sends a non-browser useragent, then server operators can ban away by useragent without worrying they'll hit legitimate browser-based users.
The useragent check is also obviously incidental; it would be trivial to remove. It's a harm mitigation technique, not a fundamental part of the protection model.
I just changed my browser's user agent in response to reading this, sighing and facepalming the whole time. I expect if Anubis catches on enough to be even slightly problematic to them, they'll do...
and always seem to have a User-Agent containing "Mozilla".
I just changed my browser's user agent in response to reading this, sighing and facepalming the whole time. I expect if Anubis catches on enough to be even slightly problematic to them, they'll do the same, and then I'll have to sigh and facepalm some more while I also find a way to evade whatever happens next.
Wild! I know Xe dating back, wow, almost 10 years when we were both working at the same company. She's built quite the rep for herself: routine links from the evil orange hellmouth site (something...
Wild! I know Xe dating back, wow, almost 10 years when we were both working at the same company. She's built quite the rep for herself: routine links from the evil orange hellmouth site (something something anti-capitalism and anti-billionaire) and now this.
The association with a specific ip address has an interesting effect when on the move. I'm on a train, so my ip is constantly changing and I have to do the proof of work on every single page load.....
The association with a specific ip address has an interesting effect when on the move.
I'm on a train, so my ip is constantly changing and I have to do the proof of work on every single page load.....
That is the problem with this solution - it tosses out people on the edge cases. By implementing this, the site has said that you are an acceptable loss (assuming you eventually get annoyed by...
That is the problem with this solution - it tosses out people on the edge cases. By implementing this, the site has said that you are an acceptable loss (assuming you eventually get annoyed by waiting three seconds or so per page load) along with everyone that is on old browsers or can't/won't run JavaScript.
Interesting. I wonder why it doesn't supply you with some sort of token after the first proof to prove you've already verified yourself? This is, after all, a sort of form of first line...
Interesting. I wonder why it doesn't supply you with some sort of token after the first proof to prove you've already verified yourself? This is, after all, a sort of form of first line authentication: identifying as human.
If the server doesn't throttle by IP, then a malicious bot can pass the Anubis check once and run wild from a single IP. Extending that same throttling to tokens would provide comparable...
If the server doesn't throttle by IP, then a malicious bot can pass the Anubis check once and run wild from a single IP. Extending that same throttling to tokens would provide comparable protections against bots while reducing the impact on users with changing IP addresses.
It may require significant changes in Anubis's design -- basing it on IP probably made things a lot easier for the initial version -- but the fundamental idea should be just as effective whether you identify users by IP or by token.
That's not what Anubis does though. You're talking about a different solution. Sure, creating a scheme to provide a token to a user based on a proof of work to allow requests is a good idea that...
That's not what Anubis does though. You're talking about a different solution.
Sure, creating a scheme to provide a token to a user based on a proof of work to allow requests is a good idea that could be effective in slowing down crawlers, but it isn't Anubis.
Having dug into the details of Anubis works now, I think I see your point. I didn't realize that the tokens are designed so that the challenge server is stateless and the tokens can be...
Having dug into the details of Anubis works now, I think I see your point. I didn't realize that the tokens are designed so that the challenge server is stateless and the tokens can be independently verified with the client IP address being an important source of randomness for the challenge.
It's not much slower than the average load speed of a webpage these days. It's also probably not required if pages are behind some authentication mechanism, meaning you'll only have trouble with...
It's not much slower than the average load speed of a webpage these days.
It's also probably not required if pages are behind some authentication mechanism, meaning you'll only have trouble with public pages where you have to navigate a lot. I imagine a typical flow for most people is (visit website -> navigate to article -> read article) which probably can be done quickly enough on a train, and you'd probably want to read the article which will take a while.
If you're savvy, you could set up a tunnel/proxy to, say, your home server or a cloud vpc with a sticky ip for cases where you want to navigate gitlan.gnome.org on a train.
All in all, I think the trade-offs are Worth it, and necessary, as the alternative is all those services becoming unavailable due to a scraper-locust plague.
Yeah. This would also be a problem for any sort of geo-based IP failover scenario (e.g, a data center or some part goes offline rerouting you to another DC via failover). That would be a lot of...
Yeah. This would also be a problem for any sort of geo-based IP failover scenario (e.g, a data center or some part goes offline rerouting you to another DC via failover). That would be a lot of proofs suddenly needing to be performed, depending on scale.
Good for her, that must be really cool to see happening with something you've built yourself. I do find the premise quite strange, though. I'm not the only one who said that Anubis is - maybe not...
Good for her, that must be really cool to see happening with something you've built yourself. I do find the premise quite strange, though. I'm not the only one who said that Anubis is - maybe not trivially, sure but still - easily defeated, but more importantly also stops your website from ever being indexed by the rest of the internet who does not have malicious intent. I get being disillusioned by the AI crawler scourge, but I really doubt that this is the right way to go about it.
My understanding is Anubis only acts on user agents that look like a browser, such as one with "Mozilla" in the name. A search engine, with a well-behaved crawler, will identify itself in the user...
My understanding is Anubis only acts on user agents that look like a browser, such as one with "Mozilla" in the name. A search engine, with a well-behaved crawler, will identify itself in the user agent string instead of pretending to be a browser, and Anubis would let it through.
Unless I missed something. Once I heard that, it addressed one of my major issues with the system, since it shouldn't impede Internet Archive or further enshrine Google's lead in the search space.
Edit: Yep, the blog post announcing it says it acts on user agents containing "Mozilla," which is every major browser for historical reasons. A non-malicious bot will have a UA like "Googlebot," which can thus be allowed or blocked by that string, and might also be well-behaved and follow robots.txt.
I am not sure that I can go along with the title. It does showcase the problem is real and that it is affecting a lot of wildly different organizations. Organizations desperate enough to start looking for novel solutions to this mass assault by scraper bots.
But without traffic numbers it is difficult to say if Anubis works and to what degree it works. Though that is difficult to prove anyway as one of the issues is that a lot of the bots don't respect the rules for robot.txt and pretend to be normal browsers.
The original premise to get Anubis to work is the assumption that these bots have limited JavaScript capabilities and always seem to have a User-Agent containing "Mozilla". Since then there also has been an update detailing some more tuning and other detection methods.
I do think Anubis does work, to some degree, I am not sure to what degree exactly. I also think that once it gains traction a lot of these bots will be adjusted to account for some roadblocks Anubis puts in the way. Which will lessen the effectiveness of Anubis until adjustments can be made. In other words, it will probably turn into an arms race.
Though Anubis has one thing going for it, the compute power needed for these scrapers to do the hashing asked from there will make it more expensive to run them. At the very least it will reduce some of these bots. Though, as I said in a previous thread about Anubis, as long as there insane amounts of VC money involved a lot of these companies will just throw more compute against the issue anyway.
I also immediately thought "arms race", much like what is fought by any Anti-Abuse organization for a major web app.
Ironic. Then it may have the unintended net effect of creating an even more rarified market for AI training leading to even fewer players at the AI table and move us even further toward AI monopolies than we already are.
This was the first one I could find, but a couple of the big groups that have deployed Anubis have shown bandwidth-usage graphs that show a huge dropoff with Anubis in place.
The original premise is just to increase the cost of being badly-behaved enough that it's no longer worth it. Empirically, the problematic scrapers masquerade as browsers and rotate through large numbers of (largely residential¹) IP addresses. If you clearly identify yourself as a bot useragent, or maintain a consistent IP address, then you mostly bypass the expense of Anubis's checks but can easily be throttled by traditional means.
The mechanism is also not really subject to an arms race; it is, to bend the metaphor quite a lot, a nuclear option. There is no getting around doing the work required for the proof-of-work.
¹ Yes, this probably means that a bunch of these are being routed through people's hacked smart toasters. Remember, the "S" in "IoT" is for "security".
There is, or was depending on what changes they made. If you read the original post I linked it clearly states it is only acting based on a few factors, like the user agent. Meaning a simple user agent change already gets bots around it. Going by em-dash's comment here this still seems to be the case.
I think you are possibly confusing the original premise of hashcash with the premise Anubis.
Yes, but at that point they're giving a strong signal that they are
indeedvery likely a bot.This deals with badly behaving bots that try to pretend they are users. How you deal with bots that say they're bots is an implementation detail and more personal choice left to you.
What Rectangle said. If a bot sends a non-browser useragent, then server operators can ban away by useragent without worrying they'll hit legitimate browser-based users.
The useragent check is also obviously incidental; it would be trivial to remove. It's a harm mitigation technique, not a fundamental part of the protection model.
I just changed my browser's user agent in response to reading this, sighing and facepalming the whole time. I expect if Anubis catches on enough to be even slightly problematic to them, they'll do the same, and then I'll have to sigh and facepalm some more while I also find a way to evade whatever happens next.
Wild! I know Xe dating back, wow, almost 10 years when we were both working at the same company. She's built quite the rep for herself: routine links from the evil orange hellmouth site (something something anti-capitalism and anti-billionaire) and now this.
The association with a specific ip address has an interesting effect when on the move.
I'm on a train, so my ip is constantly changing and I have to do the proof of work on every single page load.....
That is the problem with this solution - it tosses out people on the edge cases. By implementing this, the site has said that you are an acceptable loss (assuming you eventually get annoyed by waiting three seconds or so per page load) along with everyone that is on old browsers or can't/won't run JavaScript.
Interesting. I wonder why it doesn't supply you with some sort of token after the first proof to prove you've already verified yourself? This is, after all, a sort of form of first line authentication: identifying as human.
You could take the token and hand it off to arbitrary numbers of crawlers, defeating the purpose.
Ah. Sure. Good point.
It's always a cat and mouse game.
They'd have to all come from the same IP address, making them easy to throttle by traditional means.
This particular comment chain is discussing ways to deal with a rapidly changing IP address as a legitimate user.
Ah, you're right, I lost the context somewhere between reading and replying. Thank you for the correction.
If the server doesn't throttle by IP, then a malicious bot can pass the Anubis check once and run wild from a single IP. Extending that same throttling to tokens would provide comparable protections against bots while reducing the impact on users with changing IP addresses.
It may require significant changes in Anubis's design -- basing it on IP probably made things a lot easier for the initial version -- but the fundamental idea should be just as effective whether you identify users by IP or by token.
That's not what Anubis does though. You're talking about a different solution.
Sure, creating a scheme to provide a token to a user based on a proof of work to allow requests is a good idea that could be effective in slowing down crawlers, but it isn't Anubis.
Having dug into the details of Anubis works now, I think I see your point. I didn't realize that the tokens are designed so that the challenge server is stateless and the tokens can be independently verified with the client IP address being an important source of randomness for the challenge.
It's not much slower than the average load speed of a webpage these days.
It's also probably not required if pages are behind some authentication mechanism, meaning you'll only have trouble with public pages where you have to navigate a lot. I imagine a typical flow for most people is (visit website -> navigate to article -> read article) which probably can be done quickly enough on a train, and you'd probably want to read the article which will take a while.
If you're savvy, you could set up a tunnel/proxy to, say, your home server or a cloud vpc with a sticky ip for cases where you want to navigate gitlan.gnome.org on a train.
All in all, I think the trade-offs are Worth it, and necessary, as the alternative is all those services becoming unavailable due to a scraper-locust plague.
This is our life in this post-LLMocalyptic age.
Yeah. This would also be a problem for any sort of geo-based IP failover scenario (e.g, a data center or some part goes offline rerouting you to another DC via failover). That would be a lot of proofs suddenly needing to be performed, depending on scale.
Good for her, that must be really cool to see happening with something you've built yourself. I do find the premise quite strange, though. I'm not the only one who said that Anubis is - maybe not trivially, sure but still - easily defeated, but more importantly also stops your website from ever being indexed by the rest of the internet who does not have malicious intent. I get being disillusioned by the AI crawler scourge, but I really doubt that this is the right way to go about it.
My understanding is Anubis only acts on user agents that look like a browser, such as one with "Mozilla" in the name. A search engine, with a well-behaved crawler, will identify itself in the user agent string instead of pretending to be a browser, and Anubis would let it through.
Unless I missed something. Once I heard that, it addressed one of my major issues with the system, since it shouldn't impede Internet Archive or further enshrine Google's lead in the search space.
Edit: Yep, the blog post announcing it says it acts on user agents containing "Mozilla," which is every major browser for historical reasons. A non-malicious bot will have a UA like "Googlebot," which can thus be allowed or blocked by that string, and might also be well-behaved and follow robots.txt.
If anyone else is wondering what Anubis does, the best explanation I could find is the blog post announcing it.