58 votes

Reddit will block the Internet Archive

Posted August 11, 2025 by macleod

Tags: internet, social media, wayback machine, artificial intelligence, language models.large, archive org, reddit, scraping.web, data.training, author.jay peters, source.the verge, paywall

https://www.theverge.com/news/757538/reddit-internet-archive-wayback-machine-block-limit

Link information

This data is scraped automatically and may be incorrect.

Authors: Jay Peters
Published: Aug 11 2025
Word count: 682 words

24 comments

[3]
largepanda
August 11, 2025
Link
This is just depressing.

This is just depressing.

67 votes
1. [2]
  plutonic
  August 11, 2025
  Link Parent
  All because Reddit is in the business of selling user comments and information to AI companies. Can't have rogue companies stealing your data that Reddit never paid for.
  
  All because Reddit is in the business of selling user comments and information to AI companies. Can't have rogue companies stealing your data that Reddit never paid for.
  
  56 votes
  1. Apocalypto
    August 12, 2025
    Link Parent
    Really makes you appreciate the fact that Tildes is a non profit
    
    Really makes you appreciate the fact that Tildes is a non profit
    
    4 votes
[12]
balooga
August 11, 2025
Link
I love the Internet Archive but I’ve been saying for years that we shouldn’t be putting all our eggs in one basket. We’re one censorious regime (or hostile takeover, or destructive wildfire, or...

I love the Internet Archive but I’ve been saying for years that we shouldn’t be putting all our eggs in one basket. We’re one censorious regime (or hostile takeover, or destructive wildfire, or malicious hacker, or funding depletion, or leadership retirement, etc.) away from the collapse of the internet’s most valuable resource maybe next to Wikipedia.

We have BitTorrent and blockchains, but I’m honestly surprised that decentralized, distributed computing never really became “a thing” on any bigger scale. We should all be running nodes in a giant swarm of web scrapers, making our own collectively shared Wayback Machine. One that’s resistant to server blocks, censorship, and takedowns. Frustrating that it’s 2025 and we still don’t have anything like this.

44 votes
1. ShroudedScribe
  August 11, 2025
  Link Parent
  A lot of sites will take action against residential IP addresses that are scraping data. That could be anything from throttling to an outright ban. There are other ways to decentralize, such as...
  
  One that’s resistant to server blocks, censorship, and takedowns.
  
  A lot of sites will take action against residential IP addresses that are scraping data. That could be anything from throttling to an outright ban.
  
  There are other ways to decentralize, such as using server resources (VPS, bare metal, or even cloud compute) from various providers. But this starts to become a bit more centralized and dependent on these hosting companies, as they can close accounts however they see fit.
  
  17 votes
2. CrypticCuriosity629
  August 11, 2025 (edited August 11, 2025)
  Link Parent
  Long ago I had the idea of basically creating a decentralized crowdsourced internet archive that shares caches of sites with each other via P2P. Basically people install the plugin/Server and you...
  
  I love the Internet Archive but I’ve been saying for years that we shouldn’t be putting all our eggs in one basket.
  
  Long ago I had the idea of basically creating a decentralized crowdsourced internet archive that shares caches of sites with each other via P2P.
  
  Basically people install the plugin/Server and you configure it every time, hour, day, etc, to download webpages you frequent, or partials, HTML, etc all configurable, and it basically keeps that copy for X amount of days and will look for incremental updates to that page while filtering out ads and stuff. Of course it removes cookies and personally identifiable information too.
  
  Completely configurable to give people the control of how often to take a backup, whether or not to include pictures and images, have a way to compress images, whitelists for sites you want to backup, blacklist for sites you don't, how many incremental backups to keep, how much memory before deleting older backups, etc. Was even thinking of a plugin that used the same processes as JDownloader2 to download videos off of sites too, so doing the same for YouTube.
  
  It would work in the background via a browser plugin that connects to a hosted server that manages the backups, and basically just uses what you have in your cache and already basically downloaded, instead of like remotely downloading a website or scraping data.
  
  Those downloaded incremental backups of websites are then shared via a P2P network, so people can browse backups and download backups of said website.
  
  The idea was to have people be able to host a server where these backups can be stored and sorted, and you can configure it to download incremental backups of a website from the P2P network, and to store yours. Could be self hosted on Docker.
  
  Then of course you'd have places like Universities and Internet Archive with immense amounts of storage that would piggy back off the network too and give them the ability to make daily backups in some, and the ability to track changes over time.
  
  Hell, I even came up with a fun button that will send you to a random website that hadn't been archived in a while so people can click it when they're bored and it'll help out the network.
  
  I was talking to a few people about starting that project and the various ways we could get it to work, but like most project talk, people lose interest and move on. Never came to fruition.
  
  It would have been great for data hoarders like myself, and others that potentially need to have access to websites offline for various reasons. Also just generally good for the health of the internet too, it would have made this article a moot point.
  
  15 votes
3. [4]
  JXM
  August 11, 2025
  Link Parent
  I mean, most of the stuff on the Internet Archive is available via torrents already. Almost every item I've come across has a torrent download option. That doesn't mean you can't lose an item...
  
  I mean, most of the stuff on the Internet Archive is available via torrents already. Almost every item I've come across has a torrent download option.
  
  That doesn't mean you can't lose an item because no one is seeding it, though. So you're up against the same problem. But obviously with a torrent, you'd need everyone seeding to disappear, not just the Internet Archive.
  
  8 votes
  1. [3]
    balooga
    August 11, 2025
    Link Parent
    You’re talking about content that was scraped by IA and then shared in torrent form, right? So that would still be stymied by blocks like what Reddit’s doing.
    
    You’re talking about content that was scraped by IA and then shared in torrent form, right? So that would still be stymied by blocks like what Reddit’s doing.
    
    3 votes
    
    [2]
    JXM
    August 11, 2025
    Link Parent
    Yes, but presumably Reddit isn't just going to block the Internet Archive, but any scraper doing something similar. My original comment was responding to your comment that we need to diversify...
    
    You’re talking about content that was scraped by IA and then shared in torrent form, right? So that would still be stymied by blocks like what Reddit’s doing.
    
    Yes, but presumably Reddit isn't just going to block the Internet Archive, but any scraper doing something similar.
    
    My original comment was responding to your comment that we need to diversify away from Internet Archive being the primary source for all things "old internet" and have multiple options. If they got taken down and poofed out of existence, it would be devastating, but it wouldn't mean the immediate loss of all their knowledge since it's all also shared via torrents.
    
    7 votes
    
    balooga
    August 11, 2025
    Link Parent
    Oh I see what you mean. I think the “old internet” archives are in decent hands at this point because of that; there should be enough copies scattered around the web at this point to provide some...
    
    Oh I see what you mean. I think the “old internet” archives are in decent hands at this point because of that; there should be enough copies scattered around the web at this point to provide some resilience.
    
    I’m just concerned about future archives of today’s web, which will be “old internet” soon enough too. If we lose the primary way these archives are created, the project is dead. The reason I’m advocating for decentralized scraping is that if you spread it around enough, archival requests will be indistinguishable from regular browsing. It’s one thing to block traffic from a known data center IP, it’s another matter to play whack-a-mole with ten thousand residential addresses from around the world, crawling in a manner that is by all appearances uncoordinated.
    
    3 votes
4. [2]
  davek804
  August 12, 2025
  Link Parent
  I agree we should all be nodes in what you described. The trouble is, no one gives a shit. The solution is to make them give a shit. How do you incentivize all the people to participate? Maybe we...
  
  I agree we should all be nodes in what you described. The trouble is, no one gives a shit.
  
  The solution is to make them give a shit. How do you incentivize all the people to participate?
  
  Maybe we pay them in a free subscription to a paid service. You want Netflix? You need to keep this 1x tarball online for 95% of days in a month and you get a month free. Oh, you want a Comcast ISP subscription? That's a 3x tarball for 95% uptime.
  
  Find a way to monetize the tool sufficiently to be able to pay the users, while also being a registered not for profit?
  
  Baby, you got a stew.
  
  8 votes
  1. cuteFox
    August 12, 2025
    Link Parent
    that's essentially what private torrent trackers do, and it works pretty well
    
    that's essentially what private torrent trackers do, and it works pretty well
    
    1 vote
5. skybrian
  August 11, 2025 (edited August 12, 2025)
  Link Parent
  Bluesky is the most archive-friendly user forum infrastructure I know about. Downloading the posts from each user's PDS (which is like a home directory that stores all their posts) is necessary to...
  
  Bluesky is the most archive-friendly user forum infrastructure I know about. Downloading the posts from each user's PDS (which is like a home directory that stores all their posts) is necessary to run the app, and anyone could do that, the same way Bluesky does it.
  
  The PDS can store arbitrary data. It's technically possible to build a link-sharing app on Bluesky (here's one), but they're not very popular yet.
  
  It's also inevitable that some AI companies will use Bluesky posts for training (if they aren't already), so being archive-friendly has downsides, too.
  
  4 votes
6. Wulfsta
  August 11, 2025
  Link Parent
  IPFS seems worth mentioning here.
  
  IPFS seems worth mentioning here.
  
  3 votes
7. cuteFox
  August 12, 2025 (edited August 12, 2025)
  Link Parent
  A search engine called "YaCy" is basically what you described, it's a search engine but self hosted and connects to swarm. it has a crawler that runs locally and crawls websites and indexes and...
  
  A search engine called "YaCy" is basically what you described, it's a search engine but self hosted and connects to swarm. it has a crawler that runs locally and crawls websites and indexes and makes them searchable by others. I think you can even view the cached pages. it does work but the problem is that it's so obscure and niche that there aren't many people hosting it. it's also very heavy in terms of system resource usage, I installed it once but didn't find it very useful for myself.
  
  Edit: I just remembered one more search engine, it's called mwmbl, it's a bit different. you can install an extension which crawls websites and then sends the data to mwmbl servers, this way, the crawling is done by users. here's the link: https://github.com/mwmbl/mwmbl?tab=readme-ov-file
  
  2 votes
[9]
IsildursBane
August 11, 2025
Link
I have mixed thoughts about this one. It feels relatively obvious that Reddit is doing this to stop scrapers from accessing the data via IA, and instead companies will have to pay Reddit for...

I have mixed thoughts about this one. It feels relatively obvious that Reddit is doing this to stop scrapers from accessing the data via IA, and instead companies will have to pay Reddit for access to the data. And while it seems to just be kind of a greedy move by Reddit, it does feel within their rights to do so. I think IA's mission is a noble goal, but it does end up just being used to get around paywalls, and that the original creators/hosts do not get paid fairly for their work. If a small website reached out to IA to get them to stop archiving their website due to AI training concerns, I feel like the public perception would be in support of that website. However, when Reddit does it, the public perception is more negative. Maybe it is due to people not agreeing with how Reddit is handling AI at large, but I feel like it is valid for Reddit to push back and prevent archiving when it does not meet their standards (there was also comments by Reddit about IA archiving deleted posts, which Reddit did not appreciate).

14 votes
1. [2]
  papasquat
  August 11, 2025
  Link Parent
  I'd feel a lot more sympathy for most other entities than reddit. In this case "their work" is really the work of reddit's users. The only valuable thing reddit has is its users. The work that...
  
  I'd feel a lot more sympathy for most other entities than reddit.
  
  I think IA's mission is a noble goal, but it does end up just being used to get around paywalls, and that the original creators/hosts do not get paid fairly for their work.
  
  In this case "their work" is really the work of reddit's users. The only valuable thing reddit has is its users. The work that reddit.com actually does is actually near worthless. The platform isn't unique or interesting at all. There's no huge technical hurdles to making a new reddit, and it's actually pretty bad at most of the things it's supposed to do.
  The value is in the content created by the users, so for reddit to say "no you're not allowed to have any of this content that we didn't make, we just allowed people to make for us without compensation" is, while within their rights, pretty shitty.
  
  34 votes
  1. IsildursBane
    August 11, 2025
    Link Parent
    I think that is part of the reason that reddit does not have public support in their AI approach. It feels to most people that all reddit does is collect the cheque, but does no work for it, and...
    
    In this case "their work" is really the work of reddit's users. The only valuable thing reddit has is its users. The work that reddit.com actually does is actually near worthless.
    
    I think that is part of the reason that reddit does not have public support in their AI approach. It feels to most people that all reddit does is collect the cheque, but does no work for it, and does not pass on part of the profits to the people who are actually providing the value. I don't think reddit will ever be able to share profits with users though, as then it gets messy on how to revenue share. Do you split profits based on karma? - increase in bots and karma farming (which is already a major issue). Do you split profits based on word count? - Increase in fluff and using AI to pad out the responses. So I think that as long as reddit tries to monetize user content, they will not get public support
    
    8 votes
2. lou
  August 12, 2025 (edited August 12, 2025)
  Link Parent
  Archive.org recently allowed me to recover a good chunk of my personal history in the form of my first blog which was deleted by blogspot for reasons I will never know. It was wonderful. I don't...
  
  Archive.org recently allowed me to recover a good chunk of my personal history in the form of my first blog which was deleted by blogspot for reasons I will never know. It was wonderful.
  
  I don't care if archive.org removes the TV shows. It would still be a great thing if it was only for things like that.
  
  13 votes
3. [5]
  stu2b50
  August 11, 2025
  Link Parent
  Yep. I’m not a big fan of archive.is for the same reason. Good journalism cost money. Post adpocalpyse, subscriptions are the way. Is what it is. Don’t people hate ads, anyway? Pay for the news....
  
  Yep. I’m not a big fan of archive.is for the same reason. Good journalism cost money. Post adpocalpyse, subscriptions are the way. Is what it is.
  
  Don’t people hate ads, anyway? Pay for the news. And if you don’t want to, that’s fine, don’t read it.
  
  10 votes
  1. Tiraon
    August 11, 2025
    Link Parent
    For me a big problem is that I usually go to a random news site from a link. Right now options usually range from subscription paywall to accessing the article somehow with difficulty to not...
    
    For me a big problem is that I usually go to a random news site from a link. Right now options usually range from subscription paywall to accessing the article somehow with difficulty to not accessing the article. What I choose varies but a recurring 5-10 dollars a subscription for a site I am likely to see a handful of times that month at most is a lot, not mentioning the possibility of forgetting to cancel. If they simply had the option to prepay for that amount for lets say fifty articles(entirely random number), personally I would be significantly more willing to pay.
    
    It is that way with most subscriptions. They are absolutely terrible monetization method for the end user for a large number of content services. Right up there with ads and selling data and attention. Though better that the currently favorite pioneered method of having all of them.
    
    17 votes
  2. [3]
    gpl
    August 11, 2025
    Link Parent
    This is more or less what I believe, but I think a lot of people feel entitled to viewing media for free and chafe at the notion of paying. Same dynamic as why a lot of people pirate, even if it...
    
    Don’t people hate ads, anyway? Pay for the news. And if you don’t want to, that’s fine, don’t read it.
    
    This is more or less what I believe, but I think a lot of people feel entitled to viewing media for free and chafe at the notion of paying. Same dynamic as why a lot of people pirate, even if it is often justified in other ways. There is something about the net that makes people really abhor paying money to access things!
    
    8 votes
    
    [2]
    redwall_hp
    August 11, 2025 (edited August 11, 2025)
    Link Parent
    Information is plentiful and you have finite minutes in your life. That means, by simple market economics, your attention is a valuable commodity and content is worthless. Nobody wants to pay for...
    
    Information is plentiful and you have finite minutes in your life. That means, by simple market economics, your attention is a valuable commodity and content is worthless. Nobody wants to pay for media, and they all must compete for viewership. No amount of trying to shove the genie back in the bottle is going to magically reverse the law of supply and demand...people will just find something else to look at.
    
    This is the same reason raw ticket numbers for movies have been declining for over twenty years, despite rising prices keeping the box office figures up. There's more competition for attention, the the cost/value is slipping as ticket prices go up.
    
    The inevitable conclusion of paywalls is more people just reading headlines, and getting their news from random vertical videos on Instagram.
    
    And, personally, my goal for the past few years has been to reduce my consumption of news media. I'll wait until I hear about big news and read things then. But otherwise, it's just deleterious to mental health. It's definitely a bit much to ask for someone to drop Netflix money on morbid entertainment when they can pay for actual entertainment.
    
    10 votes
    
    skybrian
    August 12, 2025
    Link Parent
    I'll apologize in advance because this is mostly not responding to you, but I'm going to nit-pick "attention is a valuable commodity" because it's a cliché that bugs me. It's a very zoomed-out,...
    
    I'll apologize in advance because this is mostly not responding to you, but I'm going to nit-pick "attention is a valuable commodity" because it's a cliché that bugs me. It's a very zoomed-out, high-level way to put it that ignores important distinctions:
    
    Although it's true that there is only so much attention someone can give others in a day, how much it's worth varies extremely depending on circumstances. There are many kinds of attention that have negative value. (Consider that often, the attention men give women is unwanted, and privacy is about avoiding attention.)
    
    Even when someone seeks attention from strangers, it's usually not from just anyone. This is true even of marketing. That's why there are targeted ads. A purchase funnel is about quickly getting rid of unwanted attention (often most of it) and narrowing it down to just prospective customers.
    
    To take the "commodity" metaphor seriously, copper is a valuable commodity and copper ore is just a lot of rock that contains copper.
    
    When people hope to be paid for their attention, they often greatly overestimate how much it's worth. If you're not the droid they're looking for, if you get paid anything, it will be a pittance. Marketers are often paying for advertising to find someone else.
    
    8 votes