As you may know, on Glade Art we tend to take anti-bot measures very seriously; it is one of our topmost priorities to protect our fellow users from having their art trained on. We also tend to engage in trolling bots by using endless labyrinths of useless data to trap them in. These are commonly referred to as "honeypots" or "digital tar pits." And so, after 6.8 million requests in the last 55 days at the time of writing this, we have some substantial data, so standby and let us share it with you. : )
Maybe a hot take, but I really couldn't care less. My own websites and apps (like delphi.tools) all compile to static HTML and have no server component, so this isn't really impacting my...
Maybe a hot take, but I really couldn't care less. My own websites and apps (like delphi.tools) all compile to static HTML and have no server component, so this isn't really impacting my resources, and all the stuff that's on there is stuff I willingly put on the internet, so it would feel weird to me if I said "oh well bots don't get to read my blog!", as if I didn't know this could happen when I uploaded to the damn web. I get that if you're running a web app you have to protect yourself from bots putting a strain on your infra, but this problem is trivial to undetectable for anyone who just puts their stuff on the web and doesn't bloat their page with analytics and tracking that can phone home.
How do you feel about the fact that these bots ignore robots.txt? Even if it's "weird" if someone says that these pages are off-limits since it's on the internet, the fact that so many people set...
How do you feel about the fact that these bots ignore robots.txt? Even if it's "weird" if someone says that these pages are off-limits since it's on the internet, the fact that so many people set up their bots to ignore website owners requests is the unethical breaking point for me.
Not to be a joyless cynic, but I always thought it was deeply naive to expect potential bad actors to respect the honour system. I can't say I'm surprised. It's not good form, certainly, but I'm...
Not to be a joyless cynic, but I always thought it was deeply naive to expect potential bad actors to respect the honour system. I can't say I'm surprised. It's not good form, certainly, but I'm not going to say that it's unethical. If ethics was a consideration, the standard would have made provisions for direct enforcement of these rules instead of asking nicely.
Counterexample as someone who is starting to put content on the internet, I do care regardless of the resource impact. I'm not running analytics or using any non-static resources (at least not...
Counterexample as someone who is starting to put content on the internet, I do care regardless of the resource impact. I'm not running analytics or using any non-static resources (at least not currently) but I want people to interact with what I write and produce, not bots. Call it vanity perhaps, but I'm not putting things on the internet out of pure altruism - I want some amount of validation and credit and feedback. Most bots today don't provide that and more often provide the opposite in that they disintermediate between my work and potential audience. If the return (monetary or via ego boost) on putting things on the internet goes negative then people (myself included) will find alternative distribution channels, likely ones less free, widespread, or available, which would be sad. So even if bots aren't directly costing me money they are still an element of a web shifting towards more intermediaries which I'd like to avoid.
Caveat up front that I don't really have any reliable data to back up the following statements and I could easily turn out to be incorrect about the direction of the internet. Prognosticating is...
Caveat up front that I don't really have any reliable data to back up the following statements and I could easily turn out to be incorrect about the direction of the internet. Prognosticating is errorprone!
Some bots are fine, search crawlers, rss/atom feed readers, etc in theory are net directors of traffic or at least wouldn't likely detract from traffic. The bots at issue in the current internet though have a different purpose, they're LLM training data scrapers, RAG query answer bots, and other ingestors of data who have no (or negative) interest in sending traffic to my website. Their aim is to provide an alternative within which users have no need to leave, they are building a generic competitor to all other websites. A competitor which is well funded and desires to take your traffic. Trying to make their lives difficult is a very small, probably ineffective, but maybe collectively useful way to delay them taking the content and giving people who come to my website directly a benefit. I view it as attempting to prevent them from keeping everyone in their walled gardens while the rest of us can only feed their machine.
Having said all that, I don't think excluding bots is a particularly effective approach - instead I'd rather try to find audiences who actually want human content instead.
making it publicly available on the internet means someone's spinning a server. Even if you don't care about the scaping, you will care when your website slows to a crawl because the bots are...
My own websites and apps (like delphi.tools) all compile to static HTML and have no server component,
making it publicly available on the internet means someone's spinning a server. Even if you don't care about the scaping, you will care when your website slows to a crawl because the bots are basically DDOS'ing your site, right?
The caveat is that if you stay small, this may not happen. But any amount of visibility (even just some facebook post linking to your site for some reason) can change that.
this problem is trivial to undetectable for anyone who just puts their stuff on the web and doesn't bloat their page with analytics and tracking that can phone home.
It's tough because the topic here involves artists. Artists need to simultaneously show off their work to get more work. But they can't show of so much now that other malicious actors essentially steal their work.
I don't think most artists can spin up their own website. Even if they could, doing that goes against the visibility needed to advertise themselves.
It's not good form, certainly, but I'm not going to say that it's unethical
Ethics is often disconnected from law. I wouldn't say "it's not unethical because it's legal".
My website doesn't slow to a crawl though, I optimise my software and deliver through Cloudflare, which abstracts this away for free (and, in fact, does offer more robust bot protection should I...
My website doesn't slow to a crawl though, I optimise my software and deliver through Cloudflare, which abstracts this away for free (and, in fact, does offer more robust bot protection should I ever need or want it).
I don't really care to concede my point about putting things on the open web and then being surprised that people steal it. That's just the game, and it's something you have to contend with. This is not an argument, I'm just saying, this happens. If you have a problem with that, that's entirely valid and I can't blame you, but it doesn't change the fact that this happens. Yes, I also wish there was public transport everwhere, but there isn't, so until that changes we're all gonna have to learn how to drive. It's not a prescription about whether or not that is good, it's just "the way it is".[1]
And to your last point, I agree. Ethics is disconnected from law. I didn't say it's legal, I make no prescriptions about that because I am not a lawyer, and a flimsy RFC doesn't decide whether ignoring robots.txt is illegal or not, and more notably it also doesn't determine whether that's ethical. I could make a point here. What if an artist dies who has armed their portfolio page to the teeth with turrets that only shoot bots, but I decide that it's worth preserving? Do I get to ignore robots.txt then, for the benefit of the fan community that would otherwise lose access to that work in a matter of months if not weeks?
[1] Although, admittedly, the status quo also happens to align with my position. You brought the stuff onto the web, you have to know that it'll not be yours for long, and I think that's one of the great strengths of the web. Again, to me, non-issue.
Yeah scrapers are a wonderful example of all sorts of holes in our current infrastructure, standards, and legislation. "Please don't scrape my site" is basically all the teeth you've ever really...
Yeah scrapers are a wonderful example of all sorts of holes in our current infrastructure, standards, and legislation.
"Please don't scrape my site" is basically all the teeth you've ever really had unless you're behind something like cloudflare. The way traffic is handled makes it very hard to identify bad actors, and the laws mean that these AI companies can just pay Asian nations for these vast swaths of data that are totally definitely legally acquired.
Its going to be a very tricky thing to handle because while it COULD be done right, it could also be used as yet another excuse to jam horrible practices into place for future control.
It is, but I'm wary of the overhead (especially if you're already trying to do minimal or no JS), the long term effectiveness (i suspect part of it's success is due to it's lack of deployment),...
It is, but I'm wary of the overhead (especially if you're already trying to do minimal or no JS), the long term effectiveness (i suspect part of it's success is due to it's lack of deployment), and possible knock on effects (not sure how much it's going to change things if EVERY site is jamming PoW)
Edit:
Coincidentally I accidentally clicked the link again when going to see the topic on hacker news and it gave me a difficulty 8 pow. After about a minute my phone still hadn’t made any progress.
Edit 2 -
Oh and now that I’m actually in the comments section seems like I’m far from the only one with this experience.
Speaking of "minimal or no JS", the blog entry requires JS to read. It's mentioned in the text that this is part of the bot mitigation. I can see that to protect the art from being scraped, but...
Speaking of "minimal or no JS", the blog entry requires JS to read. It's mentioned in the text that this is part of the bot mitigation. I can see that to protect the art from being scraped, but maybe the blog part shouldn't be protected in that way because I may not want to disable NoScript plugin to read a blog entry.
And by the way once I temporarily enabled script on the site I noticed that I very much don't like the site aesthetics of green on black. It's not accessible. Also the image at the top of the blog entry had artifacts or something so it shimmered in an annoying way. I realize this is off topic noise but it really made me just want to leave the site.
I'm sure it's possible to use JS like they do and still have the site be accessible to those using screen readers... but if you made me put down money I'd definitely bet against this site being...
I'm sure it's possible to use JS like they do and still have the site be accessible to those using screen readers... but if you made me put down money I'd definitely bet against this site being accessible on that front, too.
There's so much more that you can do beyond saying please. Automated traffic and websites/apps have been in an arms race since before the dawn of the commercial internet. The balance has never...
"Please don't scrape my site" is basically all the teeth you've ever really had unless you're behind something like cloudflare
There's so much more that you can do beyond saying please. Automated traffic and websites/apps have been in an arms race since before the dawn of the commercial internet. The balance has never really changed. Sometimes automated traffic gets ahead a little, more often identification and blocking is a little ahead. If you don't want to be scraped, and you have the time and expertise (or resources to rent time and expertise), you can block the vast majority of it. If you just want to block the larger part of it you don't need any of the above, just an out of the box solution (like Cloudflare).
Cloudflare collected a lot of techniques that people were already using and made them easily accessible but they aren't the only, or even the best, way to deal with bots.
At the end of the day bot net operators aren't usually particularly bright or technically proficient. They're after volume not quality. People who are skilled can make more money elsewhere (with the possible exception of state sponsored operations).
The volume of bots though, that is definitely going up, and quickly.
The answer is likely "yes". There's TONS of bot farms all over the world because they're somewhat easy to run. You need the capital to get the devices you need/want and all the required...
The answer is likely "yes".
There's TONS of bot farms all over the world because they're somewhat easy to run. You need the capital to get the devices you need/want and all the required infrastructure, but then it just...does the thing with some minimal maintenance by a local (who may or may not even be good at troubleshooting and just given basic instructions or having it handled remote.)
As for the botnets, well naturally? The whole point of a botnet is to do something that requires a lot of devices, and a good way to get a lot of devices is to just run in the background of someone else's system (obviously, seldom legally). Scrapping is probably a hell of a lot more lucrative and less risk than a DDOS or whatever, and as with any resource like this you're probably looking to prevent idle/downtime. You've got the "machine" so you want it running as much as possible to generate revenue to cover costs.
https://finance.yahoo.com/news/click-farms-internet-china-154440209.html The answer is indeed "yes". They had these click farms over a decade ago ready simply to drive traffic to an app or site. I...
The answer is indeed "yes". They had these click farms over a decade ago ready simply to drive traffic to an app or site. I can only imagine the dark arts involved now with AI scaping being so in-demand.
Does Tildes use similar tarpits? I'm a logged in user so I don't see any proof of work requests; are there any such for regular browsing humans? How scrapped is the data on Tildes?
Does Tildes use similar tarpits? I'm a logged in user so I don't see any proof of work requests; are there any such for regular browsing humans? How scrapped is the data on Tildes?
On the scraping front, you can just put your username into a search engine + site:tildes.net. Your profile should almost certainly be found, along with some body description of a reasonably recent...
On the scraping front, you can just put your username into a search engine + site:tildes.net. Your profile should almost certainly be found, along with some body description of a reasonably recent post.
Yours:
Comment on The feckless opposition in ~society chocobean 10 hours, 10 minutes ago Link Parent
Mine:
It was refreshing to search online tonight after reading this historical piece about the science behind ma -made diamonds and see that it has genuinely changed finally.
Skybrian:
"You cannot make a return on investment if you don't have access to the U.S. market," Bancel told Bloomberg, noting that high-level headwinds have made the
I didn't look at how recent any of these three actually are. But they're certain examples of pages that are public and being scraped, that's for sure.
for me, most were from some 4-5 months ago on the top results. But one was from 5 days ago from the "Sora is cancelled" topic. This is just a gut feelinig, but I'm guessing it doesn't scrape from...
for me, most were from some 4-5 months ago on the top results. But one was from 5 days ago from the "Sora is cancelled" topic. This is just a gut feelinig, but I'm guessing it doesn't scrape from Tildes until a story gets a certain amount of comments or votes on it.
The Internet started out as a distributed, democratized, cooperative endeavor. This lack of centralization provided a level playing field whose participants could all have access to exposure and...
The Internet started out as a distributed, democratized, cooperative endeavor. This lack of centralization provided a level playing field whose participants could all have access to exposure and connection for a negligible amount of investment. Conversely, any information or service could be at your fingertips if you could find it. The benefits to our species and to our global society were massive (or so I would argue).
By making it so that only huge participants can afford to be connected - most small players must hide behind, and therefore be reliant on, the larger ones, as recommended in this very article - the Internet's out-of-control bot problem directly erodes that original premise. It's an aggressive and harmful attack against global society, democracy, and prosperity. It's not about copyright, the unauthorized use of one's content, but about the network effect of it all, regardless of whether a few people's technological needs are such that they can manage to coexist with this horrible dystopian situation.
Thus I would argue that logic dictates that abusive bot use should be persecuted and heavily sanctioned by the law. I'm talking serious jail time.
Maybe a hot take, but I really couldn't care less. My own websites and apps (like delphi.tools) all compile to static HTML and have no server component, so this isn't really impacting my resources, and all the stuff that's on there is stuff I willingly put on the internet, so it would feel weird to me if I said "oh well bots don't get to read my blog!", as if I didn't know this could happen when I uploaded to the damn web. I get that if you're running a web app you have to protect yourself from bots putting a strain on your infra, but this problem is trivial to undetectable for anyone who just puts their stuff on the web and doesn't bloat their page with analytics and tracking that can phone home.
How do you feel about the fact that these bots ignore robots.txt? Even if it's "weird" if someone says that these pages are off-limits since it's on the internet, the fact that so many people set up their bots to ignore website owners requests is the unethical breaking point for me.
Not to be a joyless cynic, but I always thought it was deeply naive to expect potential bad actors to respect the honour system. I can't say I'm surprised. It's not good form, certainly, but I'm not going to say that it's unethical. If ethics was a consideration, the standard would have made provisions for direct enforcement of these rules instead of asking nicely.
Counterexample as someone who is starting to put content on the internet, I do care regardless of the resource impact. I'm not running analytics or using any non-static resources (at least not currently) but I want people to interact with what I write and produce, not bots. Call it vanity perhaps, but I'm not putting things on the internet out of pure altruism - I want some amount of validation and credit and feedback. Most bots today don't provide that and more often provide the opposite in that they disintermediate between my work and potential audience. If the return (monetary or via ego boost) on putting things on the internet goes negative then people (myself included) will find alternative distribution channels, likely ones less free, widespread, or available, which would be sad. So even if bots aren't directly costing me money they are still an element of a web shifting towards more intermediaries which I'd like to avoid.
Clarity: How exactly is excluding bots helping in your goal of getting more eyes on your work? Seems like that wouldn't make a difference.
Caveat up front that I don't really have any reliable data to back up the following statements and I could easily turn out to be incorrect about the direction of the internet. Prognosticating is errorprone!
Some bots are fine, search crawlers, rss/atom feed readers, etc in theory are net directors of traffic or at least wouldn't likely detract from traffic. The bots at issue in the current internet though have a different purpose, they're LLM training data scrapers, RAG query answer bots, and other ingestors of data who have no (or negative) interest in sending traffic to my website. Their aim is to provide an alternative within which users have no need to leave, they are building a generic competitor to all other websites. A competitor which is well funded and desires to take your traffic. Trying to make their lives difficult is a very small, probably ineffective, but maybe collectively useful way to delay them taking the content and giving people who come to my website directly a benefit. I view it as attempting to prevent them from keeping everyone in their walled gardens while the rest of us can only feed their machine.
Having said all that, I don't think excluding bots is a particularly effective approach - instead I'd rather try to find audiences who actually want human content instead.
making it publicly available on the internet means someone's spinning a server. Even if you don't care about the scaping, you will care when your website slows to a crawl because the bots are basically DDOS'ing your site, right?
The caveat is that if you stay small, this may not happen. But any amount of visibility (even just some facebook post linking to your site for some reason) can change that.
It's tough because the topic here involves artists. Artists need to simultaneously show off their work to get more work. But they can't show of so much now that other malicious actors essentially steal their work.
I don't think most artists can spin up their own website. Even if they could, doing that goes against the visibility needed to advertise themselves.
Ethics is often disconnected from law. I wouldn't say "it's not unethical because it's legal".
My website doesn't slow to a crawl though, I optimise my software and deliver through Cloudflare, which abstracts this away for free (and, in fact, does offer more robust bot protection should I ever need or want it).
I don't really care to concede my point about putting things on the open web and then being surprised that people steal it. That's just the game, and it's something you have to contend with. This is not an argument, I'm just saying, this happens. If you have a problem with that, that's entirely valid and I can't blame you, but it doesn't change the fact that this happens. Yes, I also wish there was public transport everwhere, but there isn't, so until that changes we're all gonna have to learn how to drive. It's not a prescription about whether or not that is good, it's just "the way it is".[1]
And to your last point, I agree. Ethics is disconnected from law. I didn't say it's legal, I make no prescriptions about that because I am not a lawyer, and a flimsy RFC doesn't decide whether ignoring robots.txt is illegal or not, and more notably it also doesn't determine whether that's ethical. I could make a point here. What if an artist dies who has armed their portfolio page to the teeth with turrets that only shoot bots, but I decide that it's worth preserving? Do I get to ignore robots.txt then, for the benefit of the fan community that would otherwise lose access to that work in a matter of months if not weeks?
[1] Although, admittedly, the status quo also happens to align with my position. You brought the stuff onto the web, you have to know that it'll not be yours for long, and I think that's one of the great strengths of the web. Again, to me, non-issue.
Yeah scrapers are a wonderful example of all sorts of holes in our current infrastructure, standards, and legislation.
"Please don't scrape my site" is basically all the teeth you've ever really had unless you're behind something like cloudflare. The way traffic is handled makes it very hard to identify bad actors, and the laws mean that these AI companies can just pay Asian nations for these vast swaths of data that are totally definitely legally acquired.
Its going to be a very tricky thing to handle because while it COULD be done right, it could also be used as yet another excuse to jam horrible practices into place for future control.
Reading the article, proof-of-work seems to be a
tooltooth, that you have. I don't see why smaller actors can't use that.It is, but I'm wary of the overhead (especially if you're already trying to do minimal or no JS), the long term effectiveness (i suspect part of it's success is due to it's lack of deployment), and possible knock on effects (not sure how much it's going to change things if EVERY site is jamming PoW)
Edit:
Coincidentally I accidentally clicked the link again when going to see the topic on hacker news and it gave me a difficulty 8 pow. After about a minute my phone still hadn’t made any progress.
Edit 2 -
Oh and now that I’m actually in the comments section seems like I’m far from the only one with this experience.
Speaking of "minimal or no JS", the blog entry requires JS to read. It's mentioned in the text that this is part of the bot mitigation. I can see that to protect the art from being scraped, but maybe the blog part shouldn't be protected in that way because I may not want to disable NoScript plugin to read a blog entry.
And by the way once I temporarily enabled script on the site I noticed that I very much don't like the site aesthetics of green on black. It's not accessible. Also the image at the top of the blog entry had artifacts or something so it shimmered in an annoying way. I realize this is off topic noise but it really made me just want to leave the site.
I'm sure it's possible to use JS like they do and still have the site be accessible to those using screen readers... but if you made me put down money I'd definitely bet against this site being accessible on that front, too.
There's so much more that you can do beyond saying please. Automated traffic and websites/apps have been in an arms race since before the dawn of the commercial internet. The balance has never really changed. Sometimes automated traffic gets ahead a little, more often identification and blocking is a little ahead. If you don't want to be scraped, and you have the time and expertise (or resources to rent time and expertise), you can block the vast majority of it. If you just want to block the larger part of it you don't need any of the above, just an out of the box solution (like Cloudflare).
Cloudflare collected a lot of techniques that people were already using and made them easily accessible but they aren't the only, or even the best, way to deal with bots.
At the end of the day bot net operators aren't usually particularly bright or technically proficient. They're after volume not quality. People who are skilled can make more money elsewhere (with the possible exception of state sponsored operations).
The volume of bots though, that is definitely going up, and quickly.
I wonder if it's done by botnets or if people in Asia are being paid to run these things at home?
The answer is likely "yes".
There's TONS of bot farms all over the world because they're somewhat easy to run. You need the capital to get the devices you need/want and all the required infrastructure, but then it just...does the thing with some minimal maintenance by a local (who may or may not even be good at troubleshooting and just given basic instructions or having it handled remote.)
As for the botnets, well naturally? The whole point of a botnet is to do something that requires a lot of devices, and a good way to get a lot of devices is to just run in the background of someone else's system (obviously, seldom legally). Scrapping is probably a hell of a lot more lucrative and less risk than a DDOS or whatever, and as with any resource like this you're probably looking to prevent idle/downtime. You've got the "machine" so you want it running as much as possible to generate revenue to cover costs.
https://finance.yahoo.com/news/click-farms-internet-china-154440209.html
The answer is indeed "yes". They had these click farms over a decade ago ready simply to drive traffic to an app or site. I can only imagine the dark arts involved now with AI scaping being so in-demand.
Does Tildes use similar tarpits? I'm a logged in user so I don't see any proof of work requests; are there any such for regular browsing humans? How scrapped is the data on Tildes?
On the scraping front, you can just put your username into a search engine + site:tildes.net. Your profile should almost certainly be found, along with some body description of a reasonably recent post.
Yours:
Comment on The feckless opposition in ~society chocobean 10 hours, 10 minutes ago Link Parent
Mine:
It was refreshing to search online tonight after reading this historical piece about the science behind ma -made diamonds and see that it has genuinely changed finally.
Skybrian:
"You cannot make a return on investment if you don't have access to the U.S. market," Bancel told Bloomberg, noting that high-level headwinds have made the
I didn't look at how recent any of these three actually are. But they're certain examples of pages that are public and being scraped, that's for sure.
for me, most were from some 4-5 months ago on the top results. But one was from 5 days ago from the "Sora is cancelled" topic. This is just a gut feelinig, but I'm guessing it doesn't scrape from Tildes until a story gets a certain amount of comments or votes on it.
The Internet started out as a distributed, democratized, cooperative endeavor. This lack of centralization provided a level playing field whose participants could all have access to exposure and connection for a negligible amount of investment. Conversely, any information or service could be at your fingertips if you could find it. The benefits to our species and to our global society were massive (or so I would argue).
By making it so that only huge participants can afford to be connected - most small players must hide behind, and therefore be reliant on, the larger ones, as recommended in this very article - the Internet's out-of-control bot problem directly erodes that original premise. It's an aggressive and harmful attack against global society, democracy, and prosperity. It's not about copyright, the unauthorized use of one's content, but about the network effect of it all, regardless of whether a few people's technological needs are such that they can manage to coexist with this horrible dystopian situation.
Thus I would argue that logic dictates that abusive bot use should be persecuted and heavily sanctioned by the law. I'm talking serious jail time.