57 votes

Sarah Silverman is suing OpenAI and Meta for copyright infringement

31 comments

  1. [15]
    spit-evil-olive-tips
    Link
    kudos to the Verge for linking to the primary sources: https://www.documentcloud.org/documents/23869693-silverman-openai-complaint...

    kudos to the Verge for linking to the primary sources:

    https://www.documentcloud.org/documents/23869693-silverman-openai-complaint

    https://www.documentcloud.org/documents/23869675-kadrey-meta-complaint

    as well as an appendix, which shows ChatGPT easily producing a summary of one of Silverman's books, as well as the books of her 2 co-plaintiffs:

    https://www.documentcloud.org/documents/23869694-silverman-openai-complaint-exhibits

    which seems to make it pretty clear that the training data included the full text of those books.

    27 votes
    1. [5]
      teaearlgraycold
      Link Parent
      I suppose this could also indicate it has read detailed summaries of the books.

      I suppose this could also indicate it has read detailed summaries of the books.

      18 votes
      1. [4]
        spit-evil-olive-tips
        Link Parent
        as the article mentions: it's pretty easy to find on Github for the "books3" dataset they link to a tweet from 2020 from a guy talking about how he downloaded all of Bibliotik: the Pile dataset,...
        • Exemplary

        as the article mentions:

        in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.”

        it's pretty easy to find on Github

        for the "books3" dataset they link to a tweet from 2020 from a guy talking about how he downloaded all of Bibliotik:

        Nonetheless, books3, released above, is "all of bibliotik", which I imagine will be of interest to anyone doing NLP work. Or anyone who wants to read 196,640 books. :)

        the Pile dataset, in addition to those pirated ebooks, includes content from Stack Exchange. their license is a Creative Commons Attribution / ShareAlike license - requiring attribution of the original author, and that derivative works must also be released under a ShareAlike license.

        they also include a bunch of GitHub repos in the dataset. the download tool they wrote doesn't appear to take the license of the repo into account at all, so you'd have a mix of GPL / AGPL / BSD / etc code all being fed into the same model.

        so all in all I don't have any trouble believing that the people building these training models played fast and loose with copyright.

        41 votes
        1. [2]
          nacho
          Link Parent
          Thanks for this comment! I wish there were more focus on specifics for these cases. However, I don't think it comes down to under what license different type of content was in. It's going to come...

          Thanks for this comment! I wish there were more focus on specifics for these cases.

          However, I don't think it comes down to under what license different type of content was in. It's going to come down to how enforceable those licenses are, under what conditions. I can try to enforce unreasonable, unequal terms of use for different content all I want, but most places "if you're reading this then you agree to these terms"-conditions aren't worth the paper they're written on.

          I'd love a bright legal mind's take on robots.txt and how enforceable those sorts of internet standards are. I'd guess not very.

          The crux is what constitutes fair use. This is largely untested in the courts of many, many countries. This will probably also be pretty determining for things like internet archives. I mean, could I just scrape these archives instead of sites directly to circumvent many of the copyright issues?


          As a thought experiment:

          Could I borrow a digital ebook at the library, then churn it through my AI-model?

          Could I do that with a physical book, OCR and automatic flipping of pages?

          Are these situations equivalent? Think Google books-case.

          AI-training might be the first real time where huge corporations are going to send armies of lawyers to deal with Big Copyright, who've largely stood unchallenged by serious commercial interests since Disney's tour de force over the last decades.

          9 votes
          1. vektor
            Link Parent
            My understanding is no, you can't. There's no way to wash the copyright off. If someone has an illegitimate copy of my works, and they make it available to you, that does not make your copy...

            I mean, could I just scrape these archives instead of sites directly to circumvent many of the copyright issues?

            My understanding is no, you can't. There's no way to wash the copyright off. If someone has an illegitimate copy of my works, and they make it available to you, that does not make your copy legitimate. The fact that you omitted to copy who the original author is doesn't cease your obligations to them. Which is IMO a giant problem with copyright of all kinds of scraped datasets. Crawled comments off reddit? Well, considering that I think their terms of use don't hold too much water, I think there's a case to be made that all reddit crawls, even with the consent of reddit, are in violation of copyright. Big if found to be true by a court of law. There isn't that much stuff out there if you eliminate all the questionable copyright. Well, unless we carve out legal exceptions for ML models, but that would have to dance around the very real prospect of copyright whitewashing - considering the way we do LLMs these days and the overfitting they seem to be doing (see elsewhere this thread for the excerpts of the books produced by GPT...), I don't think there currently exists a way to grant a usefully broad fair-use clause while also avoiding whitewashing.

            4 votes
        2. teaearlgraycold
          Link Parent
          It should be easy enough to prove that OpenAI illegally downloaded a ton of different books and sue over that - like with the old Napster cases. This time going after mega corps instead of single...

          It should be easy enough to prove that OpenAI illegally downloaded a ton of different books and sue over that - like with the old Napster cases. This time going after mega corps instead of single moms. But I would hope that the models themselves aren't C&D'd as I really like using them.

          2 votes
    2. [8]
      AshWilliamss
      Link Parent
      Any English teacher post-Sparknotes will tell you that being able to summarize something and actually having read and understood something are two totally different things. I'd be willing to bet...

      Any English teacher post-Sparknotes will tell you that being able to summarize something and actually having read and understood something are two totally different things. I'd be willing to bet that there are summaries of this book provided by the author and the publishing company, same as most books. I feel the onus is on Silverman and company to prove that ChatGPT was fed their entire work.

      This doesn't take into account the fact that people discuss books online in forums much like this one - sometimes in great detail. I would wager that I could produce a very accurate summary of a book if I could read all availabe synopses and then read a ton of forum posts about said book, without ever actually having read the book.

      It's a tall claim to prove.

      15 votes
      1. sparksbet
        Link Parent
        The good thing about the lawsuit is that, assuming it makes it far enough, there'll be discovery, where Silverman's lawyers can get access to documents and info relevant to the case from OpenAI....

        I feel the onus is on Silverman and company to prove that ChatGPT was fed their entire work.

        The good thing about the lawsuit is that, assuming it makes it far enough, there'll be discovery, where Silverman's lawyers can get access to documents and info relevant to the case from OpenAI. It shouldn't be too difficult to verify whether or not Silverman's book is in the massive book training data set that OpenAI used.

        I agree that the summary isn't fantastic evidence, but they need something to base their claims on in order to file suit and actually get to the point where they can make OpenAI show them whether they used the book(s) in question.

        21 votes
      2. [6]
        Athing
        Link Parent
        They are using "books1" and "books2", of which there is very little know about what was in them. Discovery should uncover what was in these datasources. Books1 is likely "BookCorpus". Apparently...

        They are using "books1" and "books2", of which there is very little know about what was in them. Discovery should uncover what was in these datasources. Books1 is likely "BookCorpus". Apparently very little is known about books2, which is much larger than books1.

        So here's something... I asked ChatGPT "give me the first sentence of Max Porter's novel "Lanny" (an interesting novel I just happened to have close at hand) and it returned (after a prompt to get it on track when it initially gave the first sentence of the jacket summary):

        Apologies for the confusion. Here is the actual first sentence of Max Porter's novel "Lanny":

        "Dead Papa Toothwort wakes from his standing nap an acre wide and scrapes off dream dregs of bitumen glistening thick with liquid globs of litter."

        Which is a word-for-word exact quote of the text. There is some copyright infringement happening here.

        15 votes
        1. [2]
          Comment deleted by author
          Link Parent
          1. sparksbet
            Link Parent
            I tried something similar to @Athing, and while I similarly had to redirect it to quote the book itself rather than marketing about the book, it was able to quote longer passages of the book I...

            I tried something similar to @Athing, and while I similarly had to redirect it to quote the book itself rather than marketing about the book, it was able to quote longer passages of the book I chose. It did choose fairly popular quotable passages, however (though it was wrong with the chapter numbers it gave -- the first quote is from Chapter 10, and the second is from Chapter 20). When I asked it to continue, it did not provide the actual continuation of the text -- the text it added to the first quote is nonsense so I'm pretty sure it just made it up, but the text it added to the second one does sound like something from the book, though I don't know for sure whether it's an actual quote or just similar to one.

            These passages are mostly pretty easy to find on Google, but I used the free version of ChatGPT so it doesn't have access to Google. It is possible that rather than getting these passages from the book itself, though, it instead got them from public web-scraping of places like reddit or tumblr. I don't know if this particular book is in BookCorpus or books2 (though given its relative popularity and the size of those datasets, I'd be surprised if it wasn't).

            Whether this constitutes copyright infringement is a completely different matter, since it's the input to ChatGPT rather than its output that's in question here. Even if ChatGPT couldn't generate passages from novels (I suspect OpenAI may have some vested interest in it not doing that for the obvious legal reasons), it's an open legal question whether OpenAI's use of entire copyrighted novels without a license to train the model in the first place constitutes copyright infringement or whether it's fair use.

        2. smithsonian
          Link Parent
          It would probably be more telling to use a middle chapter, as it's not that uncommon for the first chapter of books to be published online as a sample. If I search Google Lanny max porter dead...

          It would probably be more telling to use a middle chapter, as it's not that uncommon for the first chapter of books to be published online as a sample.

          If I search Google Lanny max porter dead papa toothwort wakes, I can see at least 6 results on the first page which have that exact opening sentence.

          8 votes
        3. [3]
          updawg
          Link Parent
          ChatATHING, please give me the first sentence of A Tale of Two Cities. Or at least the first two clauses.

          ChatATHING, please give me the first sentence of A Tale of Two Cities. Or at least the first two clauses.

          1. [2]
            arch
            Link Parent
            A Tale of Two Cities is in the public domain. I'd argue that any AI that isn't trained on it is a very poorly trained one. Or has a narrow scope.

            A Tale of Two Cities is in the public domain. I'd argue that any AI that isn't trained on it is a very poorly trained one. Or has a narrow scope.

            1. updawg
              Link Parent
              Yes, but my point was just that knowing an opening sentence doesn't indicate that you have read the book.

              Yes, but my point was just that knowing an opening sentence doesn't indicate that you have read the book.

              1 vote
    3. MaoZedongers
      Link Parent
      Yeah text ai tends to plagiarize a lot from what I've seen, happens with copilot too where it can reproduce something 1 to 1 which is pretty bad for like accidentally violating a license or plain...

      Yeah text ai tends to plagiarize a lot from what I've seen, happens with copilot too where it can reproduce something 1 to 1 which is pretty bad for like accidentally violating a license or plain stealing code from restricted projects.

      It's also why a lot of companies forbid using it since it can accidentally leak secrets from the company's code and also cause legal issues.

      4 votes
  2. [11]
    drannex
    (edited )
    Link
    The more that people sue, the higher the chances that someone wins, and then the floodgates open. I really don't see this ending without OpenAI and others locking up their info in such ways that...

    The more that people sue, the higher the chances that someone wins, and then the floodgates open. I really don't see this ending without OpenAI and others locking up their info in such ways that it's nearly impossible to tell where the source was, or to the point that they commit the deadly AI sin of using them to train other models and use that as a scapegoat.

    22 votes
    1. [3]
      Comment deleted by author
      Link Parent
      1. [2]
        Comment deleted by author
        Link Parent
        1. sparksbet
          Link Parent
          The argument here would be that OpenAI reproduced the work in the process of training the model on it. ChatGPT's output is indeed unlikely to reproduce a complete copy of any of its inputs (if it...

          However, none of this might be a problem, as to violate copyright, you have to reproduce an actual creative work. You can't copyright a vague idea or style. AI reuses higher level patterns in the works and remixed them, not the actual works itself. It's extremely transformative and falls under fair use. It's quite telling that in all the months of "AI is stealing our stuff", actual examples of AI stealing stuff have been basically non-existent.

          The argument here would be that OpenAI reproduced the work in the process of training the model on it. ChatGPT's output is indeed unlikely to reproduce a complete copy of any of its inputs (if it did that would be really bad) but it's not a given that using someone's work as training data is fair use. That question hasn't been legally decided yet and it's honestly a pretty close call imo. My personal stance is that it's pretty unfair to allow models to be trained on copyrighted work without permission from or compensation to their creators, particularly with image generation imo, since they're more likely to usurp the market of the original.

          There is also the issue that you could easily overfit a generative model on a particular dataset to the extent that it regurgitates sentences or paragraphs wholesale. That's unlikely to happen with something like ChatGPT, since it uses such a large volume of data and its creators want it to be general-purpose, but this isn't necessarily the case for all models. It's very easy to overfit a model and it's only bad if you care about versatility, so you could easily make a generative AI that just generates passages from Sarah Silverman novels based on the input, and I think that's a clearer-cut case. Where do we draw the line?

          7 votes
      2. shinigami
        Link Parent
        I think the primary concern with "the deadly AI sin of using them to train other models" Is that you end up with an AI that is on a complete runaway, and nowhere close to the original intent of...

        I think the primary concern with "the deadly AI sin of using them to train other models" Is that you end up with an AI that is on a complete runaway, and nowhere close to the original intent of the model. But from my understanding, that's not how AI really works.

    2. [8]
      Gopher
      Link Parent
      Hearing about these cases really bum me out, at least in the ones that let public use it for free, this AI is more valuable to me than Sara Silverman's book, oh well

      Hearing about these cases really bum me out, at least in the ones that let public use it for free, this AI is more valuable to me than Sara Silverman's book, oh well

      1 vote
      1. [7]
        sparksbet
        Link Parent
        It's not really a matter of how useful the AI is, but rather that they shouldn't be using people's copyrighted work without their permission (and ideally some form of compensation). I work in AI...

        It's not really a matter of how useful the AI is, but rather that they shouldn't be using people's copyrighted work without their permission (and ideally some form of compensation). I work in AI and we go through a lot of trouble making sure we're licensed to use the data we train on. Authors deserve to be compensated for their work (and in OpenAI's case they've used the work of hundreds or thousands of authors for certain).

        11 votes
        1. [4]
          Caliwyrm
          Link Parent
          I've lurked threads about AI learning for some time and I am drawn into the morality aspect of it. Since you work in AI could I please ask a few questions to someone directly involved: It is my...

          I've lurked threads about AI learning for some time and I am drawn into the morality aspect of it. Since you work in AI could I please ask a few questions to someone directly involved:

          It is my understanding that that if you train an AI on 10,000 books and ask it do write something that it is not practical to ask it for a weighted attribution (7% from Book A, 5% from Book B, etc) anymore than if I wrote a story myself. Am I correct?

          As an honest question, how would you differentiate why Sara Silverman doesn't have to compensate her writing influences but an AI dev team does/should?

          Is it the number of books used to train an AI that is an issue? If it is related to this, would that imply that there is a "magic number" that is ok/not ok to use?
          Is it because since it is a computer the thought is that everything must be measurable in some fashion?
          Is it a "shakedown" from various parties due to the "big tech=big money" belief? (Similar to how news sites used to go after Google instead of using robots.txt to opt out or how ASCAP went after the Girl Scouts for singing around campfires?)

          What is an ethical way you would resolve this? (preferably without creating a middle-man money sucking entity akin to ASCAP that still doesn't pay the artists)

          One of my concerns is that any "law" they put on AI generated anything will be applied to low-tier or new artists as another barrier of entry. "Pay us our 'dues' before you can publish your book/sell your art because we know you were influenced by someone."

          (I feel justified in my concern as a result of having to deal with ASCAP and BMI in a restaurant setting and hiring musicians who only played 100% original music, no covers.)

          1. [3]
            sparksbet
            Link Parent
            This would fall into AI explainability and yes it's generally not something we can do. It's something people are very much actively researching, because we'd really like more insight into the...
            • Exemplary

            It is my understanding that that if you train an AI on 10,000 books and ask it do write something that it is not practical to ask it for a weighted attribution (7% from Book A, 5% from Book B, etc) anymore than if I wrote a story myself. Am I correct?

            This would fall into AI explainability and yes it's generally not something we can do. It's something people are very much actively researching, because we'd really like more insight into the black box, but yeah you're correct. If anything it's easier to get a weighted attribution if you wrote a story yourself, because you have enough metacognition to guess reasonable numbers lol

            As an honest question, how would you differentiate why Sara Silverman doesn't have to compensate her writing influences but an AI dev team does/should?

            It's highly likely that whatever copies of the books Sarah SIlverman read that influenced her writing were paid for, either by her or by a library or something. Pirating books is already illegal, and what OpenAI did is at minimum equivalent to that (and iirc might just... literally be that).

            There's also the issue that while this AI is just influenced in some small way based on Sarah Silverman's book, since it's one in a huge number of texts, that's not something you can guarantee about any given AI model. I could totally train a generative model on ONLY Sarah Silverman's books, and I could overfit it so that it basically just generates passages of her books. I'd be using her works the exact same way OpenAI uses them.

            Is it the number of books used to train an AI that is an issue? If it is related to this, would that imply that there is a "magic number" that is ok/not ok to use?

            If anything I think the sheer number of books might theoretically help them here, since an AI trained on just one book would much more obviously just reproduce that book's text. Huge amounts of data are how you get these big models that produce, on the whole, very generalized outputs rather than just imitating a given input. But I do think the scale of the issue might open them up to more damages if this lawsuit goes poorly for them, a la Napster.

            Is it because since it is a computer the thought is that everything must be measurable in some fashion?

            I don't really think anyone needs to have things be measurable here -- the issue isn't the degree of influence that any given work has on the output but that the works were used in training the model. I don't think you really need a way to measure how much works contribute to the AI's output to assess that.

            Is it a "shakedown" from various parties due to the "big tech=big money" belief? (Similar to how news sites used to go after Google instead of using robots.txt to opt out or how ASCAP went after the Girl Scouts for singing around campfires?)
            One of my concerns is that any "law" they put on AI generated anything will be applied to low-tier or new artists as another barrier of entry. "Pay us our 'dues' before you can publish your book/sell your art because we know you were influenced by someone."

            These two basically have the same answer, so I'm combining them. The big tech firms that are training LLMs do have big money. You HAVE to have big money to afford the resources necessary to train a model of this size. When smaller players use LLMs they use one of the pre-trained models from one of these massive tech firms and then fine-tune it on further data of their own depending on what they specifically want to use the model for.

            It seems natural to me that these smaller players should only be responsible for the data they themselves acquired and used to fine-tune while the larger players have a duty to compensate for the massive amounts of data they're using. If you can afford to train such a big model from scratch, you can afford to pay for the input data. And honestly, it's not hard to use open source datasets and other data that you're licensed to use as a smaller player. We do that where I work, and we're many orders of magnitude smaller than these AI giants. It's arguably unfair currently that huge players like OpenAI don't have to do that -- in part because everyone knows they'll be expensive AF to sue.

            What is an ethical way you would resolve this? (preferably without creating a middle-man money sucking entity akin to ASCAP that still doesn't pay the artists)

            I don't really know how ASCAP works under the hood, so I can't really do comparisons there. But generally I do think you should be required to have a license to use a copyrighted work to train your model. Will that suck and cost money for those training these absolutely massive models? Yeah. But tbh, I think it's worth it because it's fundamentally unethical to take someone's work and use it without permission for something like this. And these big tech giants will absolutely use every ounce of leeway you give them when it comes to acquiring training data.

            Licensing every individual book you want to train on one-by-one is going to be prohibitively difficult, so I wouldn't be surprised if some sort of middleman cropped up, even if it were just the publishers that current exist (they're already who you contact to license other rights, after all). I could see individual publishers selling access to their entire catalogs for AI training for some sort of fee, and authors presumably opting in or out of that based on their contracts with the publisher. But the specifics of how licensing would be handled are much more outside my area of expertise. I don't know enough about why ASCAP sucks to have ideas on how to avoid the same pitfalls. Someone with more legal knowledge would have to weigh in on how to best arrange something like that so that individual creators would be compensated well enough for it.

            It is worth noting, though, that this wouldn't be a burden on anyone in the same position as, say, a restaurant hiring live musicians. Publically available datasets for stuff like this generally have licenses that you can read to see whether you're licensed to use that data. It's already possible to buy datasets (though I don't know much about that, since we don't do that). Smaller players are already making use of those tools, and it's not hard to play "on the safe side" with licenses. The only people who would absolutely have to deal with a licensing requirement like this would be the massive tech giants, and it's not something they could possibly end up "surprised" by. They definitely have the resources to ensure they're in compliance with licensing requirements for data they train on if that ends up being required of them.

            5 votes
            1. [2]
              Caliwyrm
              Link Parent
              I genuinely thank you for your responses. It has been incredibly insightful for me to learn a few things, reinforce some things I thought and be influenced by other lines of thinking that I hadn't...

              I genuinely thank you for your responses. It has been incredibly insightful for me to learn a few things, reinforce some things I thought and be influenced by other lines of thinking that I hadn't considered.

              1. sparksbet
                Link Parent
                Oh damn thanks! I'm super glad you got something out of what I wrote, it felt a bit like rambling in the moment.

                Oh damn thanks! I'm super glad you got something out of what I wrote, it felt a bit like rambling in the moment.

        2. [2]
          Gopher
          Link Parent
          I'm still not excited, but I'm one of those whonsee copyright is an unjust law, ifni could rewrite copyright I'd have it last 5, maybe 10 years before the works become public domain, disney is my...

          I'm still not excited, but I'm one of those whonsee copyright is an unjust law, ifni could rewrite copyright I'd have it last 5, maybe 10 years before the works become public domain, disney is my arch nemesis

          1. sparksbet
            Link Parent
            We'll have to agree to disagree there, I think. Copyright law definitely has its flaws but it doesn't only affect big companies like Disney, and it's the small creators who are really going to be...

            We'll have to agree to disagree there, I think. Copyright law definitely has its flaws but it doesn't only affect big companies like Disney, and it's the small creators who are really going to be screwed over if training machine learning models is held to be fair use -- especially visual artists, whose work can be more easily replaced and replicated by the resulting models when big companies use their work to train them.

            2 votes
  3. [3]
    blindmikey
    Link
    We need to be very careful penalizing AI learning. I'm all for making certain produced works illegal just as you would a human, but to make learning/training illegal - you'll deal a death blow to...

    We need to be very careful penalizing AI learning. I'm all for making certain produced works illegal just as you would a human, but to make learning/training illegal - you'll deal a death blow to open public AI leaving closed private corporate AI to flourish to our detriment. The balance of power will be tipped possibly permanently.

    Keep this simple - if it's illegal for a human it should be illegal for AI.

    3 votes
    1. [2]
      ix-ix
      Link Parent
      Is downloading 200,000 book from an archive illegal for humans? This is not a case about "should LLM be allowed to read" it's a case about "should they be allow to illegally download a book for it...

      Is downloading 200,000 book from an archive illegal for humans? This is not a case about "should LLM be allowed to read" it's a case about "should they be allow to illegally download a book for it to read.

      3 votes
      1. blindmikey
        Link Parent
        So Sarah Silverman's case would be dropped if Open AI had paid for a copy of these books? That's certainly not my understanding.

        So Sarah Silverman's case would be dropped if Open AI had paid for a copy of these books? That's certainly not my understanding.

        3 votes
  4. pete_the_paper_boat
    Link
    Interesting in the case of Meta, to go after the users of a dataset, rather than the creators of said dataset. I suppose the zuck does have more money.

    Interesting in the case of Meta, to go after the users of a dataset, rather than the creators of said dataset.

    I suppose the zuck does have more money.

    2 votes
  5. loz
    Link
    Come on, don't bring Bibliotik into this, I struggle to believe that any major AI company is harvesting data from a private tracker. Especially one that's such a pain in the behind to get onto and...

    Come on, don't bring Bibliotik into this, I struggle to believe that any major AI company is harvesting data from a private tracker. Especially one that's such a pain in the behind to get onto and maintain ratio.