67 votes

Reddit has a new AI training deal to sell user content

53 comments

  1. [6]
    Tiraon
    Link
    They had a massive site of niche communities full of knowledgeable people making reddit the defacto default for finding organic knowledge for non professional searches. Then they spent a more than...

    They had a massive site of niche communities full of knowledgeable people making reddit the defacto default for finding organic knowledge for non professional searches. Then they spent a more than a decade just stamping it out to force a low effort regurgitated content and then they just continued forcing out community conscious moderators and high effort posters leading to the site today which will probably continue to deteriorate.

    I don't actually know if this deal would be more valuable if they tried to maintain the utility of their site but it certainly seems ironic. Also some LLM will be trained on the current content which does not seem likely to lead to terribly useful outputs and as a bonus there is unknown but likely non trivial amount of masked bot content there already.

    Then there is the fact the the site is adding another monetization to the users without their informed consent.

    57 votes
    1. [2]
      hobbes64
      Link Parent
      Reddit doesn't seem to understand, or care, that it is already a zombie. I guess it will live on for years like the Simpsons does, a shadow of its former self. The owners may still make money, at...

      Reddit doesn't seem to understand, or care, that it is already a zombie. I guess it will live on for years like the Simpsons does, a shadow of its former self. The owners may still make money, at least in the short term, but they killed most of the value of the site in the last few years and it will become more and more noticeable over time, and everyone will go somewhere else eventually. They could have been a lot more successful if they didn't chase short term monetization so hard.

      39 votes
      1. chocobean
        Link Parent
        This is a bit like selling an old growth forest that's already been clear cut to nothing: you can still sell it on the internet to buyers who don't know better, until the satellite images catch up.

        This is a bit like selling an old growth forest that's already been clear cut to nothing: you can still sell it on the internet to buyers who don't know better, until the satellite images catch up.

        18 votes
    2. [2]
      teaearlgraycold
      Link Parent
      “Think about how stupid the average person is. Then remember half of them are dumber than that!” Even a terrible lowest-common-denominator version of the site will be useful indefinitely.

      “Think about how stupid the average person is. Then remember half of them are dumber than that!”

      Even a terrible lowest-common-denominator version of the site will be useful indefinitely.

      12 votes
      1. Tiraon
        Link Parent
        The point about lowest common denominator is that for large enough set(and anything targeted at mainstream qualifies handily) that the expected value will always be one. Meaning it ignores any...

        The point about lowest common denominator is that for large enough set(and anything targeted at mainstream qualifies handily) that the expected value will always be one. Meaning it ignores any kind of personal preference or any other characteristic of the viewer, meaning it will always be inferior compared to tighter experience for basically anyone.

        My theory is that the widespread acceptance is mainly due to conformity and the insane pace of modern life.

        As to the reddit itself I just don't know, it may remain in use but the days of quality subreddits are probably over.

        6 votes
    3. raze2012
      Link Parent
      60m a year does sound like it's worth it. Apparently the revenue for 2023 was 800m, so 7.5% revenue in exchange for no extra work outside of making sure people can properly harvest the data they...

      I don't actually know if this deal would be more valuable if they tried to maintain the utility of their site but it certainly seems ironic

      60m a year does sound like it's worth it. Apparently the revenue for 2023 was 800m, so 7.5% revenue in exchange for no extra work outside of making sure people can properly harvest the data they don't need to produce sounds like easy money.

      Now is that worth it for AI companies? I don't know, sounds like a pain given all the bots they'd need to filter through.

      11 votes
  2. [18]
    boredop
    Link
    Reddit is hopping on the AI train. Quelle surprise! I'm wondering if I should bother deleting my posts/comments or my account as a whole if I don't want my contributions to be a part of this....

    Reddit is hopping on the AI train. Quelle surprise! I'm wondering if I should bother deleting my posts/comments or my account as a whole if I don't want my contributions to be a part of this. Maybe it's already too late.

    29 votes
    1. Drewbahr
      Link Parent
      The second best time to do it, is right now.

      The second best time to do it, is right now.

      26 votes
    2. Minty
      Link Parent
      It's too late for that, but it's not too late for other data harvests. https://rentry.co/unreddit Let me know if you use this, but there are any issues (it's from the Exodus times).

      It's too late for that, but it's not too late for other data harvests.
      https://rentry.co/unreddit
      Let me know if you use this, but there are any issues (it's from the Exodus times).

      16 votes
    3. [9]
      stu2b50
      Link Parent
      The golden rule of prod data is to never delete anything (unless it’s required by law). Always tombstone. So yeah it’s not going to do anything.

      The golden rule of prod data is to never delete anything (unless it’s required by law). Always tombstone. So yeah it’s not going to do anything.

      16 votes
      1. [2]
        unkz
        Link Parent
        Not quite true iirc, Reddit can undelete data but it can’t revert changes so the method is to edit your comment to new text (eg. “Deleted”) and then delete it. Although they may have changed that...

        Not quite true iirc, Reddit can undelete data but it can’t revert changes so the method is to edit your comment to new text (eg. “Deleted”) and then delete it.

        Although they may have changed that in the wake of the API protest.

        9 votes
        1. Moonchild
          Link Parent
          I can all but guarantee that, if this was ever true, it is not now.

          I can all but guarantee that, if this was ever true, it is not now.

          18 votes
      2. [6]
        langis_on
        Link Parent
        What do you mean Tombstone?

        What do you mean Tombstone?

        1. [5]
          rubix
          Link Parent
          Flag the data as deleted in a database to stop exposing it publicly. This ensures the data is retained and creates the appearance of being deleted to end users.

          Flag the data as deleted in a database to stop exposing it publicly. This ensures the data is retained and creates the appearance of being deleted to end users.

          17 votes
          1. [4]
            langis_on
            Link Parent
            Hmm. How is that done? I deleted my reddit stuff last summer when the API shit went down and a ton of it is still publicly available.

            Hmm. How is that done? I deleted my reddit stuff last summer when the API shit went down and a ton of it is still publicly available.

            4 votes
            1. [3]
              teaearlgraycold
              Link Parent
              Technically this is usually done with a “deleted_at” timestamp column for each row in the comments table in the database. Then when displaying comments you filter for where deleted_at is NULL. As...

              Technically this is usually done with a “deleted_at” timestamp column for each row in the comments table in the database. Then when displaying comments you filter for where deleted_at is NULL.

              As for your situation, if you delete your account I believe Reddit just stops showing your username next to the comments. They don’t actually delete anything.

              15 votes
              1. [2]
                langis_on
                Link Parent
                I didn't delete my account, I used a plug in to delete all of my comments and posts individually

                I didn't delete my account, I used a plug in to delete all of my comments and posts individually

                1. CptBluebear
                  Link Parent
                  The idea of tombstoning is the same as you putting files in the trashcan on your PC. It's still there if you need to restore it, but it's gone from its saved location. So while you deleted your...

                  The idea of tombstoning is the same as you putting files in the trashcan on your PC. It's still there if you need to restore it, but it's gone from its saved location.

                  So while you deleted your posts on your end, in Reddit's database it's merely tombstoned: in the trashcan, but not gone.

                  Whether or not Reddit does that I don't actually know. It is standard to do so however.

                  5 votes
    4. [2]
      winnietherpooh
      Link Parent
      This post was the kick in the pants I needed to finally delete my reddit account. Thanks y'all! I ended storing/copying the lengthier quality comments I've made (not too many, I'm a bit of a...

      This post was the kick in the pants I needed to finally delete my reddit account. Thanks y'all!

      I ended storing/copying the lengthier quality comments I've made (not too many, I'm a bit of a lurker) in a Notion doc. However, I'm realizing that this may be ultimately useless as Notion now has AI built into it. Sigh.

      10 votes
      1. ThrowdoBaggins
        Link Parent
        If you have the patience to wait up to 30 days, you could ask Reddit for a copy of your data first (looks like they legally have up to 30 days to comply, and you can bet they’ll use every minute...

        If you have the patience to wait up to 30 days, you could ask Reddit for a copy of your data first (looks like they legally have up to 30 days to comply, and you can bet they’ll use every minute they can get away with to delay your request)

        They send a CSV with a comprehensive list of every comment and post you’ve ever made with your account, so while it’s not super easy to search in future, you have all the data in a surprisingly compact file. Not sure if they also contain the images or videos you post, almost certainly not. But at least you’ll have the text.

        3 votes
    5. [2]
      OBLIVIATER
      Link Parent
      They almost certainly have already archived everything for a dataset

      They almost certainly have already archived everything for a dataset

      8 votes
      1. blueshiftlabs
        Link Parent
        The most useful dataset for Reddit to sell would be pre-ChatGPT posts and comments, so I imagine they archived stuff off ages ago.

        The most useful dataset for Reddit to sell would be pre-ChatGPT posts and comments, so I imagine they archived stuff off ages ago.

        6 votes
    6. raze2012
      Link Parent
      EU has Right to be Forgotten, and a few US states have similar data privacy laws. Otherwise: it may be too late. But as another said: 2nd best time is now

      EU has Right to be Forgotten, and a few US states have similar data privacy laws. Otherwise: it may be too late.

      But as another said: 2nd best time is now

      8 votes
  3. [6]
    drannex
    Link
    With the incessant amount of AI posted garbage in the comments and posts these days, this is certainly ripe for horrific results. GIGO errors are certainly the norm these days it seems.

    With the incessant amount of AI posted garbage in the comments and posts these days, this is certainly ripe for horrific results. GIGO errors are certainly the norm these days it seems.

    20 votes
    1. [5]
      boredop
      Link Parent
      Of course training AI on text written by actual Redditors might not be much better!

      Of course training AI on text written by actual Redditors might not be much better!

      14 votes
      1. [4]
        GunnarRunnar
        Link Parent
        I guess it depends. It's real-life internet talk that can make a bot sound like a human but it's also fucking repetitive. I don't know if there's any real informational value in there. People...

        I guess it depends. It's real-life internet talk that can make a bot sound like a human but it's also fucking repetitive. I don't know if there's any real informational value in there. People sharing about their expertise is probably behind Reddit, though there's a lot of it already there but that has to have been scraped by bots already.

        7 votes
        1. [3]
          boredop
          Link Parent
          I was referring more to all the typos and auto-correct errors. It's ugly out there!

          I was referring more to all the typos and auto-correct errors. It's ugly out there!

          4 votes
          1. [2]
            GunnarRunnar
            Link Parent
            Hah, never considered my atrocious grammar hurting LLMs. Now I can say I type like a preschooler for a purpose.

            Hah, never considered my atrocious grammar hurting LLMs. Now I can say I type like a preschooler for a purpose.

            7 votes
            1. Minty
              Link Parent
              It doesn't hurt them, it improves them. It's a form of data augmentation. Makes the LLM more resistant to mistakes in the user input. Just type better :p

              It doesn't hurt them, it improves them. It's a form of data augmentation. Makes the LLM more resistant to mistakes in the user input.

              Just type better :p

              10 votes
  4. [5]
    SloMoMonday
    Link
    My guess is that this is the real reason beind the API price increase. Their must have seen a dramatic increase in queries when the big models started training and found out just what they were...

    My guess is that this is the real reason beind the API price increase. Their must have seen a dramatic increase in queries when the big models started training and found out just what they were willing to pay for the data.

    And I'm fairly sure any user generated data pre-GPT is going at a premium. That well is completely poisoned by now and we're seeing AI generated content designed to influence AI generated content.

    But you'd be dumb not to plug your data into s training set, even an in-house one. All that streamed traffic and weather data. Decades of SCADA readings. Tax and medical files. If you have clean data, someone will pay for it right now.

    17 votes
    1. [3]
      teaearlgraycold
      Link Parent
      Low background radiation training data.

      Low background radiation training data.

      7 votes
      1. [2]
        CptBluebear
        Link Parent
        Yeah, it's like pre-nuclear testing steel.

        Yeah, it's like pre-nuclear testing steel.

        5 votes
        1. OBLIVIATER
          Link Parent
          I love this analogy, it feels so apt. Do you think the AI revolution will be as world changing as the nuclear revolution was? Will it be as scary? More scary?

          I love this analogy, it feels so apt. Do you think the AI revolution will be as world changing as the nuclear revolution was? Will it be as scary? More scary?

  5. [4]
    Moonchild
    Link
    'Meh'... I have every confidence that unscrupulous companies will scrape whatever they want, willy-nilly. Stop engaging on reddit? Sure. Stop engaging on reddit in favour of tildes? A somewhat...

    'Meh'...

    I have every confidence that unscrupulous companies will scrape whatever they want, willy-nilly. Stop engaging on reddit? Sure. Stop engaging on reddit in favour of tildes? A somewhat meaningless gesture. The social contracts of the internet, which have long been under attack, are now effectively completely dead—this is the final nail in the coffin. And there is hence an increasing tension, among the technically- and socially-conscious, with respect to the public sharing of information. I know one very talented computer programmer who has committed to not publicly sharing or releasing any of his future projects, for this reason. (Coincidentally, this happened around the same time as he shared a tip about a particular robot that was joining and surreptitiously logging discord channels on behalf of a chinese company. In case you doubted 'by hook or by crook...')

    The legal angle is interesting, but academic. Some have hoped the courts would rule that copyright can be laundered through machine learning models, and that the possibility of being able to launder all copyrights would be tantamount to repealing copyright altogether. Others have hoped they would rule that training ml models on copyrighted data is an infringing use. In point of fact, they have—very predictably (well, hindsight is 20/20, but I did call this one, and I think it's fairly obvious unless you sequester yourself in a cloister of formalism and uncoloured bits)—opted to minimise disruption and ruled that training ml models is legal, and infringing output is infringing output.

    11 votes
    1. [3]
      raze2012
      Link Parent
      I see it less about screwing over companies (which is nigh impossible with our current culture and regulation) and more about trying to see and engage with the kinds of community you want to see....

      Stop engaging on reddit? Sure. Stop engaging on reddit in favour of tildes? A somewhat meaningless gesture.

      I see it less about screwing over companies (which is nigh impossible with our current culture and regulation) and more about trying to see and engage with the kinds of community you want to see. The mass and mainstream will never be such a community for someone who wants to see quality, nuanced, discussion on matters.

      there is hence an increasing tension, among the technically- and socially-conscious, with respect to the public sharing of information.

      unfortunate, but inevitable among the tech conscious. I do wonder how big the impact this will have on the Open-Source communities knowing all that hard teamwork can be extracted by trillionaires. I imagine there will at the very least be a lot more shifts from MIT to GPL or other kinds of license.

      7 votes
      1. [2]
        Moonchild
        Link Parent
        You can certainly do that, but it is not what people in this thread are talking about. And reddit is not a monolith.

        less about screwing over companies and more about trying to see and engage with the kinds of community you want to see

        You can certainly do that, but it is not what people in this thread are talking about. And reddit is not a monolith.

        2 votes
        1. raze2012
          Link Parent
          debatable. In the same way you can argue that Wal-Mart isn't a monolith in the strictest sense of the word. I think in a colloquial sense there's enough people saying that "reddit is the only...

          And reddit is not a monolith.

          debatable. In the same way you can argue that Wal-Mart isn't a monolith in the strictest sense of the word. I think in a colloquial sense there's enough people saying that "reddit is the only place for [niche community]" provides enough of an argument to suggest it has monopolies in dozens, hundreds of small hobbyist communities. So many fear leaving reddit because there is literally no alternative for those niches.

          but it is not what people in this thread are talking about.

          I'd give them the same advice. I think we've seen at least 3 moderately-scaled attempts to boycott reddit and we know how well that went; It doesn't work. AI boundaries are challenged in courts as we speak, nothing to do there than wait. I wouldn't advise someone to delete their reddit on the basis of stopping the AI overlords from farming your content.

          Most of us can't save the world, but many of us collectively can seek our own preferences. 10,000 people leaving reddit won't even register as a blip to Reddit, but 10,000 migrating to their own community can easily make or break a new forum. If people could shift their mindset away from "killing big site" and towards "cultivating new site", we'd solve many of those niche alternatives overnight.

          11 votes
  6. Amarok
    Link
    Perhaps they can explain why one should pay them for this data when the entire history of reddit is archived right here for free. I've already got every comment and submission ever made to every...

    Perhaps they can explain why one should pay them for this data when the entire history of reddit is archived right here for free. I've already got every comment and submission ever made to every music subreddit, thanks. Sometime when the AI tools manage to become performant and reliable I might see what they can make of it all.

    Everything posted post-exodus is trash compared to the content of those archives, since it's all downhill from here. They have nothing worth buying.

    5 votes
  7. [13]
    semitones
    Link
    I was just about to go on Reddit and edit my old posts/comments, but then I had a thought -- can the LLMs scrape tildes.net for training data as well?

    I was just about to go on Reddit and edit my old posts/comments, but then I had a thought -- can the LLMs scrape tildes.net for training data as well?

    4 votes
    1. [3]
      talklittle
      Link Parent
      Sadly yes. Every publicly facing website has its data up for grabs, and the scraping companies believe they can get away with it for the most part regardless of legality. I wonder if this trend...

      Sadly yes. Every publicly facing website has its data up for grabs, and the scraping companies believe they can get away with it for the most part regardless of legality.

      I wonder if this trend will trigger a resurgence of private communities. It's not a 100% deterrent but would raise the barrier significantly.

      Otherwise I imagine some opinionated people may decide to opt out of the Internet as a communications platform entirely. This is very sad to me.

      15 votes
      1. [2]
        raze2012
        Link Parent
        The most popular alternative people moved to during the API protests was indeed Discord. So that may not be as far fetched as we think. It ruins a lot of the point of public forums, but for a lot...

        The most popular alternative people moved to during the API protests was indeed Discord. So that may not be as far fetched as we think. It ruins a lot of the point of public forums, but for a lot of modern audiences, they just want to get their quick word in or question asked and take off. For them, Discord makes a lot of sense.

        7 votes
        1. Thallassa
          Link Parent
          Discord is using their user data as training data too.

          Discord is using their user data as training data too.

          3 votes
    2. [7]
      teaearlgraycold
      Link Parent
      @Deimos - perhaps we could have the OpenAI scraper added to the robots.txt? Of course, other bots will still scrape the content.

      @Deimos - perhaps we could have the OpenAI scraper added to the robots.txt? Of course, other bots will still scrape the content.

      10 votes
      1. [5]
        Deimos
        Link Parent
        I've added it, following their instructions here: https://platform.openai.com/docs/gptbot Of course, it doesn't guarantee anything, but if there are any other AI companies to add that you know of,...

        I've added it, following their instructions here: https://platform.openai.com/docs/gptbot

        Of course, it doesn't guarantee anything, but if there are any other AI companies to add that you know of, let me know.

        16 votes
        1. [2]
          cfabbro
          (edited )
          Link Parent
          Speaking of, hungariantoast recently created a related Gitlab issue you might also want to take a look at: https://gitlab.com/tildes/tildes/-/issues/818 Given how robots.txt is often ignored, I...

          Speaking of, hungariantoast recently created a related Gitlab issue you might also want to take a look at:
          https://gitlab.com/tildes/tildes/-/issues/818

          Given how robots.txt is often ignored, I don't know how effective it would actually be though.

          5 votes
          1. teaearlgraycold
            Link Parent
            The playbook is to collect data without any way to opt out when you’re a little scrappy startup. Then once you’re a big name have a big announcement that you respect robots.txt now and can opt out...

            The playbook is to collect data without any way to opt out when you’re a little scrappy startup. Then once you’re a big name have a big announcement that you respect robots.txt now and can opt out of their data harvesting process.

            So I do think these big guys are respecting the opt out. Although they might acquire datasets from 3rd parties who don’t.

            5 votes
        2. [2]
          GunnarRunnar
          Link Parent
          It's kinda disturbing that all this is opt-out instead of opt-in, there isn't a default opt-out of everything? I guess either way everything gets scraped without permissions anyway (and probably...

          It's kinda disturbing that all this is opt-out instead of opt-in, there isn't a default opt-out of everything? I guess either way everything gets scraped without permissions anyway (and probably packaged and sold as its own package to those who want that type of data).

          4 votes
          1. teaearlgraycold
            Link Parent
            There should be. robots.txt could block all bots with a couple of lines. But of course almost no one respects it.

            There should be. robots.txt could block all bots with a couple of lines. But of course almost no one respects it.

            4 votes
      2. patience_limited
        Link Parent
        Funny you should mention that... I was just about to post this article, which mentions that it's becoming impossible to use robots.txt because it's being actively ignored by AI scraping bots.

        Funny you should mention that... I was just about to post this article, which mentions that it's becoming impossible to use robots.txt because it's being actively ignored by AI scraping bots.

        14 votes
    3. [2]
      balooga
      Link Parent
      If it’s on the internet, of course they can.

      If it’s on the internet, of course they can.

      3 votes
      1. semitones
        Link Parent
        So the whole reason for Reddit hiking the API price is that the API was letting the AI scrape reddit more efficiently than just reading the public website?

        So the whole reason for Reddit hiking the API price is that the API was letting the AI scrape reddit more efficiently than just reading the public website?

        4 votes