They had a massive site of niche communities full of knowledgeable people making reddit the defacto default for finding organic knowledge for non professional searches. Then they spent a more than...
They had a massive site of niche communities full of knowledgeable people making reddit the defacto default for finding organic knowledge for non professional searches. Then they spent a more than a decade just stamping it out to force a low effort regurgitated content and then they just continued forcing out community conscious moderators and high effort posters leading to the site today which will probably continue to deteriorate.
I don't actually know if this deal would be more valuable if they tried to maintain the utility of their site but it certainly seems ironic. Also some LLM will be trained on the current content which does not seem likely to lead to terribly useful outputs and as a bonus there is unknown but likely non trivial amount of masked bot content there already.
Then there is the fact the the site is adding another monetization to the users without their informed consent.
Reddit doesn't seem to understand, or care, that it is already a zombie. I guess it will live on for years like the Simpsons does, a shadow of its former self. The owners may still make money, at...
Reddit doesn't seem to understand, or care, that it is already a zombie. I guess it will live on for years like the Simpsons does, a shadow of its former self. The owners may still make money, at least in the short term, but they killed most of the value of the site in the last few years and it will become more and more noticeable over time, and everyone will go somewhere else eventually. They could have been a lot more successful if they didn't chase short term monetization so hard.
This is a bit like selling an old growth forest that's already been clear cut to nothing: you can still sell it on the internet to buyers who don't know better, until the satellite images catch up.
This is a bit like selling an old growth forest that's already been clear cut to nothing: you can still sell it on the internet to buyers who don't know better, until the satellite images catch up.
“Think about how stupid the average person is. Then remember half of them are dumber than that!” Even a terrible lowest-common-denominator version of the site will be useful indefinitely.
“Think about how stupid the average person is. Then remember half of them are dumber than that!”
Even a terrible lowest-common-denominator version of the site will be useful indefinitely.
The point about lowest common denominator is that for large enough set(and anything targeted at mainstream qualifies handily) that the expected value will always be one. Meaning it ignores any...
The point about lowest common denominator is that for large enough set(and anything targeted at mainstream qualifies handily) that the expected value will always be one. Meaning it ignores any kind of personal preference or any other characteristic of the viewer, meaning it will always be inferior compared to tighter experience for basically anyone.
My theory is that the widespread acceptance is mainly due to conformity and the insane pace of modern life.
As to the reddit itself I just don't know, it may remain in use but the days of quality subreddits are probably over.
60m a year does sound like it's worth it. Apparently the revenue for 2023 was 800m, so 7.5% revenue in exchange for no extra work outside of making sure people can properly harvest the data they...
I don't actually know if this deal would be more valuable if they tried to maintain the utility of their site but it certainly seems ironic
60m a year does sound like it's worth it. Apparently the revenue for 2023 was 800m, so 7.5% revenue in exchange for no extra work outside of making sure people can properly harvest the data they don't need to produce sounds like easy money.
Now is that worth it for AI companies? I don't know, sounds like a pain given all the bots they'd need to filter through.
Reddit is hopping on the AI train. Quelle surprise! I'm wondering if I should bother deleting my posts/comments or my account as a whole if I don't want my contributions to be a part of this....
Reddit is hopping on the AI train. Quelle surprise! I'm wondering if I should bother deleting my posts/comments or my account as a whole if I don't want my contributions to be a part of this. Maybe it's already too late.
Far, far too late. The whole thing that started off the big reddit API change was supposedly that reddit was upset that AI companies were harvesting data from the API feed....
Far, far too late. The whole thing that started off the big reddit API change was supposedly that reddit was upset that AI companies were harvesting data from the API feed.
It's too late for that, but it's not too late for other data harvests. https://rentry.co/unreddit Let me know if you use this, but there are any issues (it's from the Exodus times).
It's too late for that, but it's not too late for other data harvests. https://rentry.co/unreddit
Let me know if you use this, but there are any issues (it's from the Exodus times).
Not quite true iirc, Reddit can undelete data but it can’t revert changes so the method is to edit your comment to new text (eg. “Deleted”) and then delete it. Although they may have changed that...
Not quite true iirc, Reddit can undelete data but it can’t revert changes so the method is to edit your comment to new text (eg. “Deleted”) and then delete it.
Although they may have changed that in the wake of the API protest.
Flag the data as deleted in a database to stop exposing it publicly. This ensures the data is retained and creates the appearance of being deleted to end users.
Flag the data as deleted in a database to stop exposing it publicly. This ensures the data is retained and creates the appearance of being deleted to end users.
Technically this is usually done with a “deleted_at” timestamp column for each row in the comments table in the database. Then when displaying comments you filter for where deleted_at is NULL. As...
Technically this is usually done with a “deleted_at” timestamp column for each row in the comments table in the database. Then when displaying comments you filter for where deleted_at is NULL.
As for your situation, if you delete your account I believe Reddit just stops showing your username next to the comments. They don’t actually delete anything.
The idea of tombstoning is the same as you putting files in the trashcan on your PC. It's still there if you need to restore it, but it's gone from its saved location. So while you deleted your...
The idea of tombstoning is the same as you putting files in the trashcan on your PC. It's still there if you need to restore it, but it's gone from its saved location.
So while you deleted your posts on your end, in Reddit's database it's merely tombstoned: in the trashcan, but not gone.
Whether or not Reddit does that I don't actually know. It is standard to do so however.
This post was the kick in the pants I needed to finally delete my reddit account. Thanks y'all! I ended storing/copying the lengthier quality comments I've made (not too many, I'm a bit of a...
This post was the kick in the pants I needed to finally delete my reddit account. Thanks y'all!
I ended storing/copying the lengthier quality comments I've made (not too many, I'm a bit of a lurker) in a Notion doc. However, I'm realizing that this may be ultimately useless as Notion now has AI built into it. Sigh.
If you have the patience to wait up to 30 days, you could ask Reddit for a copy of your data first (looks like they legally have up to 30 days to comply, and you can bet they’ll use every minute...
If you have the patience to wait up to 30 days, you could ask Reddit for a copy of your data first (looks like they legally have up to 30 days to comply, and you can bet they’ll use every minute they can get away with to delay your request)
They send a CSV with a comprehensive list of every comment and post you’ve ever made with your account, so while it’s not super easy to search in future, you have all the data in a surprisingly compact file. Not sure if they also contain the images or videos you post, almost certainly not. But at least you’ll have the text.
EU has Right to be Forgotten, and a few US states have similar data privacy laws. Otherwise: it may be too late. But as another said: 2nd best time is now
EU has Right to be Forgotten, and a few US states have similar data privacy laws. Otherwise: it may be too late.
With the incessant amount of AI posted garbage in the comments and posts these days, this is certainly ripe for horrific results. GIGO errors are certainly the norm these days it seems.
With the incessant amount of AI posted garbage in the comments and posts these days, this is certainly ripe for horrific results. GIGO errors are certainly the norm these days it seems.
I guess it depends. It's real-life internet talk that can make a bot sound like a human but it's also fucking repetitive. I don't know if there's any real informational value in there. People...
I guess it depends. It's real-life internet talk that can make a bot sound like a human but it's also fucking repetitive. I don't know if there's any real informational value in there. People sharing about their expertise is probably behind Reddit, though there's a lot of it already there but that has to have been scraped by bots already.
It doesn't hurt them, it improves them. It's a form of data augmentation. Makes the LLM more resistant to mistakes in the user input. Just type better :p
It doesn't hurt them, it improves them. It's a form of data augmentation. Makes the LLM more resistant to mistakes in the user input.
My guess is that this is the real reason beind the API price increase. Their must have seen a dramatic increase in queries when the big models started training and found out just what they were...
My guess is that this is the real reason beind the API price increase. Their must have seen a dramatic increase in queries when the big models started training and found out just what they were willing to pay for the data.
And I'm fairly sure any user generated data pre-GPT is going at a premium. That well is completely poisoned by now and we're seeing AI generated content designed to influence AI generated content.
But you'd be dumb not to plug your data into s training set, even an in-house one. All that streamed traffic and weather data. Decades of SCADA readings. Tax and medical files. If you have clean data, someone will pay for it right now.
I love this analogy, it feels so apt. Do you think the AI revolution will be as world changing as the nuclear revolution was? Will it be as scary? More scary?
I love this analogy, it feels so apt. Do you think the AI revolution will be as world changing as the nuclear revolution was? Will it be as scary? More scary?
'Meh'... I have every confidence that unscrupulous companies will scrape whatever they want, willy-nilly. Stop engaging on reddit? Sure. Stop engaging on reddit in favour of tildes? A somewhat...
'Meh'...
I have every confidence that unscrupulous companies will scrape whatever they want, willy-nilly. Stop engaging on reddit? Sure. Stop engaging on reddit in favour of tildes? A somewhat meaningless gesture. The social contracts of the internet, which have long been under attack, are now effectively completely dead—this is the final nail in the coffin. And there is hence an increasing tension, among the technically- and socially-conscious, with respect to the public sharing of information. I know one very talented computer programmer who has committed to not publicly sharing or releasing any of his future projects, for this reason. (Coincidentally, this happened around the same time as he shared a tip about a particular robot that was joining and surreptitiously logging discord channels on behalf of a chinese company. In case you doubted 'by hook or by crook...')
The legal angle is interesting, but academic. Some have hoped the courts would rule that copyright can be laundered through machine learning models, and that the possibility of being able to launder all copyrights would be tantamount to repealing copyright altogether. Others have hoped they would rule that training ml models on copyrighted data is an infringing use. In point of fact, they have—very predictably (well, hindsight is 20/20, but I did call this one, and I think it's fairly obvious unless you sequester yourself in a cloister of formalism and uncoloured bits)—opted to minimise disruption and ruled that training ml models is legal, and infringing output is infringing output.
I see it less about screwing over companies (which is nigh impossible with our current culture and regulation) and more about trying to see and engage with the kinds of community you want to see....
Stop engaging on reddit? Sure. Stop engaging on reddit in favour of tildes? A somewhat meaningless gesture.
I see it less about screwing over companies (which is nigh impossible with our current culture and regulation) and more about trying to see and engage with the kinds of community you want to see. The mass and mainstream will never be such a community for someone who wants to see quality, nuanced, discussion on matters.
there is hence an increasing tension, among the technically- and socially-conscious, with respect to the public sharing of information.
unfortunate, but inevitable among the tech conscious. I do wonder how big the impact this will have on the Open-Source communities knowing all that hard teamwork can be extracted by trillionaires. I imagine there will at the very least be a lot more shifts from MIT to GPL or other kinds of license.
debatable. In the same way you can argue that Wal-Mart isn't a monolith in the strictest sense of the word. I think in a colloquial sense there's enough people saying that "reddit is the only...
And reddit is not a monolith.
debatable. In the same way you can argue that Wal-Mart isn't a monolith in the strictest sense of the word. I think in a colloquial sense there's enough people saying that "reddit is the only place for [niche community]" provides enough of an argument to suggest it has monopolies in dozens, hundreds of small hobbyist communities. So many fear leaving reddit because there is literally no alternative for those niches.
but it is not what people in this thread are talking about.
I'd give them the same advice. I think we've seen at least 3 moderately-scaled attempts to boycott reddit and we know how well that went; It doesn't work. AI boundaries are challenged in courts as we speak, nothing to do there than wait. I wouldn't advise someone to delete their reddit on the basis of stopping the AI overlords from farming your content.
Most of us can't save the world, but many of us collectively can seek our own preferences. 10,000 people leaving reddit won't even register as a blip to Reddit, but 10,000 migrating to their own community can easily make or break a new forum. If people could shift their mindset away from "killing big site" and towards "cultivating new site", we'd solve many of those niche alternatives overnight.
Perhaps they can explain why one should pay them for this data when the entire history of reddit is archived right here for free. I've already got every comment and submission ever made to every...
Perhaps they can explain why one should pay them for this data when the entire history of reddit is archived right here for free. I've already got every comment and submission ever made to every music subreddit, thanks. Sometime when the AI tools manage to become performant and reliable I might see what they can make of it all.
Everything posted post-exodus is trash compared to the content of those archives, since it's all downhill from here. They have nothing worth buying.
Sadly yes. Every publicly facing website has its data up for grabs, and the scraping companies believe they can get away with it for the most part regardless of legality. I wonder if this trend...
Sadly yes. Every publicly facing website has its data up for grabs, and the scraping companies believe they can get away with it for the most part regardless of legality.
I wonder if this trend will trigger a resurgence of private communities. It's not a 100% deterrent but would raise the barrier significantly.
Otherwise I imagine some opinionated people may decide to opt out of the Internet as a communications platform entirely. This is very sad to me.
The most popular alternative people moved to during the API protests was indeed Discord. So that may not be as far fetched as we think. It ruins a lot of the point of public forums, but for a lot...
The most popular alternative people moved to during the API protests was indeed Discord. So that may not be as far fetched as we think. It ruins a lot of the point of public forums, but for a lot of modern audiences, they just want to get their quick word in or question asked and take off. For them, Discord makes a lot of sense.
I've added it, following their instructions here: https://platform.openai.com/docs/gptbot Of course, it doesn't guarantee anything, but if there are any other AI companies to add that you know of,...
Speaking of, hungariantoast recently created a related Gitlab issue you might also want to take a look at: https://gitlab.com/tildes/tildes/-/issues/818 Given how robots.txt is often ignored, I...
The playbook is to collect data without any way to opt out when you’re a little scrappy startup. Then once you’re a big name have a big announcement that you respect robots.txt now and can opt out...
The playbook is to collect data without any way to opt out when you’re a little scrappy startup. Then once you’re a big name have a big announcement that you respect robots.txt now and can opt out of their data harvesting process.
So I do think these big guys are respecting the opt out. Although they might acquire datasets from 3rd parties who don’t.
It's kinda disturbing that all this is opt-out instead of opt-in, there isn't a default opt-out of everything? I guess either way everything gets scraped without permissions anyway (and probably...
It's kinda disturbing that all this is opt-out instead of opt-in, there isn't a default opt-out of everything? I guess either way everything gets scraped without permissions anyway (and probably packaged and sold as its own package to those who want that type of data).
Funny you should mention that... I was just about to post this article, which mentions that it's becoming impossible to use robots.txt because it's being actively ignored by AI scraping bots.
Funny you should mention that... I was just about to post this article, which mentions that it's becoming impossible to use robots.txt because it's being actively ignored by AI scraping bots.
So the whole reason for Reddit hiking the API price is that the API was letting the AI scrape reddit more efficiently than just reading the public website?
So the whole reason for Reddit hiking the API price is that the API was letting the AI scrape reddit more efficiently than just reading the public website?
They had a massive site of niche communities full of knowledgeable people making reddit the defacto default for finding organic knowledge for non professional searches. Then they spent a more than a decade just stamping it out to force a low effort regurgitated content and then they just continued forcing out community conscious moderators and high effort posters leading to the site today which will probably continue to deteriorate.
I don't actually know if this deal would be more valuable if they tried to maintain the utility of their site but it certainly seems ironic. Also some LLM will be trained on the current content which does not seem likely to lead to terribly useful outputs and as a bonus there is unknown but likely non trivial amount of masked bot content there already.
Then there is the fact the the site is adding another monetization to the users without their informed consent.
Reddit doesn't seem to understand, or care, that it is already a zombie. I guess it will live on for years like the Simpsons does, a shadow of its former self. The owners may still make money, at least in the short term, but they killed most of the value of the site in the last few years and it will become more and more noticeable over time, and everyone will go somewhere else eventually. They could have been a lot more successful if they didn't chase short term monetization so hard.
This is a bit like selling an old growth forest that's already been clear cut to nothing: you can still sell it on the internet to buyers who don't know better, until the satellite images catch up.
“Think about how stupid the average person is. Then remember half of them are dumber than that!”
Even a terrible lowest-common-denominator version of the site will be useful indefinitely.
The point about lowest common denominator is that for large enough set(and anything targeted at mainstream qualifies handily) that the expected value will always be one. Meaning it ignores any kind of personal preference or any other characteristic of the viewer, meaning it will always be inferior compared to tighter experience for basically anyone.
My theory is that the widespread acceptance is mainly due to conformity and the insane pace of modern life.
As to the reddit itself I just don't know, it may remain in use but the days of quality subreddits are probably over.
60m a year does sound like it's worth it. Apparently the revenue for 2023 was 800m, so 7.5% revenue in exchange for no extra work outside of making sure people can properly harvest the data they don't need to produce sounds like easy money.
Now is that worth it for AI companies? I don't know, sounds like a pain given all the bots they'd need to filter through.
Reddit is hopping on the AI train. Quelle surprise! I'm wondering if I should bother deleting my posts/comments or my account as a whole if I don't want my contributions to be a part of this. Maybe it's already too late.
The second best time to do it, is right now.
Far, far too late. The whole thing that started off the big reddit API change was supposedly that reddit was upset that AI companies were harvesting data from the API feed.
https://www.reddit.com/r/MachineLearning/comments/12r7qi7/d_new_reddit_api_terms_effectively_bans_all_use/
It's too late for that, but it's not too late for other data harvests.
https://rentry.co/unreddit
Let me know if you use this, but there are any issues (it's from the Exodus times).
The golden rule of prod data is to never delete anything (unless it’s required by law). Always tombstone. So yeah it’s not going to do anything.
Not quite true iirc, Reddit can undelete data but it can’t revert changes so the method is to edit your comment to new text (eg. “Deleted”) and then delete it.
Although they may have changed that in the wake of the API protest.
I can all but guarantee that, if this was ever true, it is not now.
What do you mean Tombstone?
Flag the data as deleted in a database to stop exposing it publicly. This ensures the data is retained and creates the appearance of being deleted to end users.
Hmm. How is that done? I deleted my reddit stuff last summer when the API shit went down and a ton of it is still publicly available.
Technically this is usually done with a “deleted_at” timestamp column for each row in the comments table in the database. Then when displaying comments you filter for where deleted_at is NULL.
As for your situation, if you delete your account I believe Reddit just stops showing your username next to the comments. They don’t actually delete anything.
I didn't delete my account, I used a plug in to delete all of my comments and posts individually
The idea of tombstoning is the same as you putting files in the trashcan on your PC. It's still there if you need to restore it, but it's gone from its saved location.
So while you deleted your posts on your end, in Reddit's database it's merely tombstoned: in the trashcan, but not gone.
Whether or not Reddit does that I don't actually know. It is standard to do so however.
This post was the kick in the pants I needed to finally delete my reddit account. Thanks y'all!
I ended storing/copying the lengthier quality comments I've made (not too many, I'm a bit of a lurker) in a Notion doc. However, I'm realizing that this may be ultimately useless as Notion now has AI built into it. Sigh.
If you have the patience to wait up to 30 days, you could ask Reddit for a copy of your data first (looks like they legally have up to 30 days to comply, and you can bet they’ll use every minute they can get away with to delay your request)
They send a CSV with a comprehensive list of every comment and post you’ve ever made with your account, so while it’s not super easy to search in future, you have all the data in a surprisingly compact file. Not sure if they also contain the images or videos you post, almost certainly not. But at least you’ll have the text.
They almost certainly have already archived everything for a dataset
The most useful dataset for Reddit to sell would be pre-ChatGPT posts and comments, so I imagine they archived stuff off ages ago.
EU has Right to be Forgotten, and a few US states have similar data privacy laws. Otherwise: it may be too late.
But as another said: 2nd best time is now
With the incessant amount of AI posted garbage in the comments and posts these days, this is certainly ripe for horrific results. GIGO errors are certainly the norm these days it seems.
Of course training AI on text written by actual Redditors might not be much better!
I guess it depends. It's real-life internet talk that can make a bot sound like a human but it's also fucking repetitive. I don't know if there's any real informational value in there. People sharing about their expertise is probably behind Reddit, though there's a lot of it already there but that has to have been scraped by bots already.
I was referring more to all the typos and auto-correct errors. It's ugly out there!
Hah, never considered my atrocious grammar hurting LLMs. Now I can say I type like a preschooler for a purpose.
It doesn't hurt them, it improves them. It's a form of data augmentation. Makes the LLM more resistant to mistakes in the user input.
Just type better :p
My guess is that this is the real reason beind the API price increase. Their must have seen a dramatic increase in queries when the big models started training and found out just what they were willing to pay for the data.
And I'm fairly sure any user generated data pre-GPT is going at a premium. That well is completely poisoned by now and we're seeing AI generated content designed to influence AI generated content.
But you'd be dumb not to plug your data into s training set, even an in-house one. All that streamed traffic and weather data. Decades of SCADA readings. Tax and medical files. If you have clean data, someone will pay for it right now.
That's correct. That's the main reason Reddit raised API prices: https://www.nytimes.com/2023/04/18/technology/reddit-ai-openai-google.html
Low background radiation training data.
Yeah, it's like pre-nuclear testing steel.
I love this analogy, it feels so apt. Do you think the AI revolution will be as world changing as the nuclear revolution was? Will it be as scary? More scary?
'Meh'...
I have every confidence that unscrupulous companies will scrape whatever they want, willy-nilly. Stop engaging on reddit? Sure. Stop engaging on reddit in favour of tildes? A somewhat meaningless gesture. The social contracts of the internet, which have long been under attack, are now effectively completely dead—this is the final nail in the coffin. And there is hence an increasing tension, among the technically- and socially-conscious, with respect to the public sharing of information. I know one very talented computer programmer who has committed to not publicly sharing or releasing any of his future projects, for this reason. (Coincidentally, this happened around the same time as he shared a tip about a particular robot that was joining and surreptitiously logging discord channels on behalf of a chinese company. In case you doubted 'by hook or by crook...')
The legal angle is interesting, but academic. Some have hoped the courts would rule that copyright can be laundered through machine learning models, and that the possibility of being able to launder all copyrights would be tantamount to repealing copyright altogether. Others have hoped they would rule that training ml models on copyrighted data is an infringing use. In point of fact, they have—very predictably (well, hindsight is 20/20, but I did call this one, and I think it's fairly obvious unless you sequester yourself in a cloister of formalism and uncoloured bits)—opted to minimise disruption and ruled that training ml models is legal, and infringing output is infringing output.
I see it less about screwing over companies (which is nigh impossible with our current culture and regulation) and more about trying to see and engage with the kinds of community you want to see. The mass and mainstream will never be such a community for someone who wants to see quality, nuanced, discussion on matters.
unfortunate, but inevitable among the tech conscious. I do wonder how big the impact this will have on the Open-Source communities knowing all that hard teamwork can be extracted by trillionaires. I imagine there will at the very least be a lot more shifts from MIT to GPL or other kinds of license.
You can certainly do that, but it is not what people in this thread are talking about. And reddit is not a monolith.
debatable. In the same way you can argue that Wal-Mart isn't a monolith in the strictest sense of the word. I think in a colloquial sense there's enough people saying that "reddit is the only place for [niche community]" provides enough of an argument to suggest it has monopolies in dozens, hundreds of small hobbyist communities. So many fear leaving reddit because there is literally no alternative for those niches.
I'd give them the same advice. I think we've seen at least 3 moderately-scaled attempts to boycott reddit and we know how well that went; It doesn't work. AI boundaries are challenged in courts as we speak, nothing to do there than wait. I wouldn't advise someone to delete their reddit on the basis of stopping the AI overlords from farming your content.
Most of us can't save the world, but many of us collectively can seek our own preferences. 10,000 people leaving reddit won't even register as a blip to Reddit, but 10,000 migrating to their own community can easily make or break a new forum. If people could shift their mindset away from "killing big site" and towards "cultivating new site", we'd solve many of those niche alternatives overnight.
Perhaps they can explain why one should pay them for this data when the entire history of reddit is archived right here for free. I've already got every comment and submission ever made to every music subreddit, thanks. Sometime when the AI tools manage to become performant and reliable I might see what they can make of it all.
Everything posted post-exodus is trash compared to the content of those archives, since it's all downhill from here. They have nothing worth buying.
I was just about to go on Reddit and edit my old posts/comments, but then I had a thought -- can the LLMs scrape tildes.net for training data as well?
Sadly yes. Every publicly facing website has its data up for grabs, and the scraping companies believe they can get away with it for the most part regardless of legality.
I wonder if this trend will trigger a resurgence of private communities. It's not a 100% deterrent but would raise the barrier significantly.
Otherwise I imagine some opinionated people may decide to opt out of the Internet as a communications platform entirely. This is very sad to me.
The most popular alternative people moved to during the API protests was indeed Discord. So that may not be as far fetched as we think. It ruins a lot of the point of public forums, but for a lot of modern audiences, they just want to get their quick word in or question asked and take off. For them, Discord makes a lot of sense.
Discord is using their user data as training data too.
@Deimos - perhaps we could have the OpenAI scraper added to the robots.txt? Of course, other bots will still scrape the content.
I've added it, following their instructions here: https://platform.openai.com/docs/gptbot
Of course, it doesn't guarantee anything, but if there are any other AI companies to add that you know of, let me know.
Speaking of, hungariantoast recently created a related Gitlab issue you might also want to take a look at:
https://gitlab.com/tildes/tildes/-/issues/818
Given how robots.txt is often ignored, I don't know how effective it would actually be though.
The playbook is to collect data without any way to opt out when you’re a little scrappy startup. Then once you’re a big name have a big announcement that you respect robots.txt now and can opt out of their data harvesting process.
So I do think these big guys are respecting the opt out. Although they might acquire datasets from 3rd parties who don’t.
It's kinda disturbing that all this is opt-out instead of opt-in, there isn't a default opt-out of everything? I guess either way everything gets scraped without permissions anyway (and probably packaged and sold as its own package to those who want that type of data).
There should be. robots.txt could block all bots with a couple of lines. But of course almost no one respects it.
Funny you should mention that... I was just about to post this article, which mentions that it's becoming impossible to use robots.txt because it's being actively ignored by AI scraping bots.
If it’s on the internet, of course they can.
So the whole reason for Reddit hiking the API price is that the API was letting the AI scrape reddit more efficiently than just reading the public website?