Two authors file a lawsuit against OpenAI, alleging that ChatGPT unlawfully ‘ingested’ their books

[15]

Wulfsta

July 9, 2023 (edited July 9, 2023)

Link

I don’t understand how if I read and publish a summary of their books it would violate their copyright, which seems to be effectively what they’re suing for? Additionally, using their works to...

I don’t understand how if I read and publish a summary of their books it would violate their copyright, which seems to be effectively what they’re suing for?

Additionally, using their works to create something like ChatGPT would be a transformative use of the material anyways, wouldn’t it?

I’m just baffled about what their standing is.

Edit: See @Algernon_Asimov’s comment here.

14 votes

[6]
pienix
July 9, 2023
Link Parent
If you would have stolen or illegally downloaded the book, then write and post a summary, and they would be able to prove that where you got the information from was illegally obtained, I believe...

If you would have stolen or illegally downloaded the book, then write and post a summary, and they would be able to prove that where you got the information from was illegally obtained, I believe they would have a point. The problem is not the summary itself, but where ChatGPT got the information.

However that might (and probably is) not the case here. There is no reason to assume that the information didn't come from a plethora of texts, articles, discussions,... freely and legally available. And even if it wasn't, there's no way to prove that.

It's true that we don't truly know what data exactly was used to train the model, but I for one, am giving them the benefit of the doubt for now.

10 votes
1. psi
  July 9, 2023
  Link Parent
  Presumably OpenAI still has access to their own training set, so it should be possible (in principle) to search OpenAI's corpus for the author's works. Obviously the authors can't do that right...
  
  There is no reason to assume that the information didn't come from a plethora of texts, articles, discussions,... freely and legally available. And even if it wasn't, there's no way to prove that.
  
  Presumably OpenAI still has access to their own training set, so it should be possible (in principle) to search OpenAI's corpus for the author's works.
  
  Obviously the authors can't do that right now, but they don't need to prove anything with certainty at this point in the lawsuit -- they only need to provide enough evidence so that the lawsuit can move forward to discovery, at which point I assume OpenAI would be compelled to make such a search.
  
  8 votes
2. [3]
  Algernon_Asimov (OP)
  July 9, 2023
  Link Parent
  You might want to read this comment I just posted, with an extract from the legal complaint.
  
  You might want to read this comment I just posted, with an extract from the legal complaint.
  
  3 votes
  1. [2]
    pienix
    July 9, 2023
    Link Parent
    Thanks, if that is indeed the case, then a) they definitely might have a case, and b) very stupid from the part of Openai. PS: Flowers for Algernon is a great novel ;-)
    
    Thanks, if that is indeed the case, then a) they definitely might have a case, and b) very stupid from the part of Openai.
    
    PS: Flowers for Algernon is a great novel ;⁠-⁠)
    
    1 vote
    
    Algernon_Asimov (OP)
    July 9, 2023
    Link Parent
    I prefer the original short story. :)
    
    PS: Flowers for Algernon is a great novel ;⁠-⁠)
    
    I prefer the original short story. :)
    
    2 votes
3. Wulfsta
  July 9, 2023
  Link Parent
  Aha, this does seem to be the point. Thanks for clarifying!
  
  Aha, this does seem to be the point. Thanks for clarifying!
  
  1 vote
Algernon_Asimov (OP)
July 9, 2023
Link Parent
They're saying that ChatGPT is able to produce a summary of their books because it has "read" the full texts of their books. And, as per my other comment, they allege that Open AI gained those...

I don’t understand how if I read and publish a summary of their books it would violate their copyright,

They're saying that ChatGPT is able to produce a summary of their books because it has "read" the full texts of their books. And, as per my other comment, they allege that Open AI gained those books illegally.

4 votes
[7]
guamisc
July 9, 2023
Link Parent
I would have to assume that they would have to make a copy of the training set of data and then feed it into the model. Does the model run directly on data currently on the internet or a local...

I would have to assume that they would have to make a copy of the training set of data and then feed it into the model. Does the model run directly on data currently on the internet or a local copy. It's my understanding that creating a local copy would run afoul of copyright law.

... Not that I generally agree with the US's abomination of copyright laws in the first place.
1. [4]
  Wulfsta
  July 9, 2023
  Link Parent
  The model does not store the training data, and while it can make calls to APIs (tool use) to retrieve data or do calculations, it likely is not looking at this data when providing a summary...
  
  The model does not store the training data, and while it can make calls to APIs (tool use) to retrieve data or do calculations, it likely is not looking at this data when providing a summary (though as @psi mentions elsewhere in this thread, they could be supplementing the results with a search).
  1. [3]
    guamisc
    July 9, 2023
    Link Parent
    Right, but what matters is the training set itself here. I assume they have to make a copy of the data before training the model. They don't just let the model train on...
    
    Right, but what matters is the training set itself here.
    
    I assume they have to make a copy of the data before training the model. They don't just let the model train on "https://Lookabook.com/hereisabook" for instance.
    
    1 vote
    
    [2]
    Wulfsta
    July 9, 2023
    Link Parent
    Ah, yes - I misunderstood your point.
    
    Ah, yes - I misunderstood your point.
    
    1 vote
    
    guamisc
    July 9, 2023
    Link Parent
    No worries, It wasn't clear enough and could be easily understood in the way you took it.
    
    No worries, It wasn't clear enough and could be easily understood in the way you took it.
    
    1 vote
2. [2]
  zazowoo
  July 9, 2023
  Link Parent
  I'm curious to hear more about this. What is the US doing wrong with copyright, and how could it be better?
  
  ... Not that I generally agree with the US's abomination of copyright laws in the first place.
  
  I'm curious to hear more about this. What is the US doing wrong with copyright, and how could it be better?
  1. guamisc
    July 9, 2023 (edited July 9, 2023)
    Link Parent
    TL;DR - copyright protections only exist to better humanity, copyright is theft from humanity, and copyright law currently just enables abusive monopolies. I will go back to the copyright clause...
    
    TL;DR - copyright protections only exist to better humanity, copyright is theft from humanity, and copyright law currently just enables abusive monopolies.
    
    I will go back to the copyright clause in the Constitution:
    
    To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.
    
    We'll start from the parts I think are most egregious and work backwards.
    
    "Limited times" would mean that there is some reasonable end date. I would peg that before multiple generations are born and die. Copyright terms are lifetime of the author + 70 years or for anonymous/work-for-hire 95/120 years depending. You cannot make a reasonable argument for "limited times" definition by ensuring that multiple generations have been born and died before copyright expires. The original copyright terms length was 14 years + 14 year extension. We should go back to that.
    
    Copyright exists "to promote the progress of science and the useful arts" NOT for profit or protect intellectual property rights. Such "rights" only exist to better humanity by giving incentive to creators. Copyright is a theft against humanity and is a recent creation historically speaking. We got along just fine when knowledge was shared, published, handed-down, etc. for literally thousands and thousands of years.
    
    By expanding copyright abilities to beyond "authors and inventors" to included immortal amoral soulless profit seeking engines, we've moved even further from the purpose of copyright (and patent) law is supposed to be doing: bettering humanity.
    
    Lots of patents have been granted to things which fail the novel and non-obvious which perverts the meaning of "writings and discoveries". There are entire classes of patents which are obvious to experts in the field of whatever or simply patents granted for transmuting regular business/mathematics processes into digital form.
    
    If you want to keep your "IP", don't publish it.
    
    1 vote

[4]

Handshape

July 9, 2023

Link

This is the elephant in the room for generative AI, and LLMs in particular. Proponents tend to assert that "if it's on the internet, it's fair game", or "it's the same as if a person did it", or...

This is the elephant in the room for generative AI, and LLMs in particular.

Proponents tend to assert that "if it's on the internet, it's fair game", or "it's the same as if a person did it", or "each work makes up only a tiny part of the model so it's diluted".

All of these ignore the training sets. These sets likely contain all manner of copyrighted, licensed, and personal information in forms that are not substantially different from their originals. They're almost certainly going to be derivative works in the eyes of most systems of law.

Are LLMs powerful and cool? Absolutely... but they're also only half of what the big LLM-based products are doing as a service - they're also "laundering" training data.

At the very least, the undisclosed training sets are going to prevent privacy-conscious governments from adopting the big LLM products unless some kind of regulatory change happens.

7 votes

[2]
pienix
July 9, 2023 (edited July 9, 2023)
Link Parent
Why do you assume that? I would think a company is not going to burn itself by illegally downloading a bunch of books, especially when there is more than enough data freely available. And if they...

These sets likely contain all manner of copyrighted, licensed, and personal information in forms that are not substantially different from their originals

Why do you assume that? I would think a company is not going to burn itself by illegally downloading a bunch of books, especially when there is more than enough data freely available. And if they bought the books or license to read, I don't really an issue, as the model isn't actually copying anything.

Edit: @Algernon_Asimov posted this. It might very well be that a lot of illegal copies where used for training.

4 votes
1. Trauma
  July 9, 2023
  Link Parent
  There actually isn't "more than enough" data freely available, but even if it were, you're forgetting that these companies are all racing each other. Just by coming in a package that's easier...
  
  I would think a company is not going to burn itself by illegally downloading a bunch of books, especially when there is more than enough data freely available.
  
  There actually isn't "more than enough" data freely available, but even if it were, you're forgetting that these companies are all racing each other. Just by coming in a package that's easier incorporated into your data set an illegal source may be more attractive than a legal one. Particularly because until now the actual consequences of doing so were nil.
  
  If your chatbot has become the name of LLMs world wide like ChatGPT has, having bent a few laws to get there was very likely worth it. You can deal with the fall out once you're established - in this case ChatGPT now has enough backing to just buy the rights to every single book in the zlib library and pay lawyers and damages until the problem goes away.
  
  This wild west mentality is common in startups (remember Facebook?) and shouldn't really surprise you, and given that they usually get away with it it's not stupid, either.
  
  2 votes
ourari
July 9, 2023
Link Parent
Yup. Dutch journalists managed to shed some light on what Dutch-language sources might have been used. It's not good. and Translated with Deepl, cleaned up by me. Here's the article which you can...

they're also "laundering" training data.

Yup. Dutch journalists managed to shed some light on what Dutch-language sources might have been used. It's not good.

Docplayer.nl was for a long time one of the Internet's main pirate nests and a gold mine for hackers. They could go there to retrieve private data from data leaks or traces of [Dutch National Intelligence] reports lying around. It contains fully completed resumes and tax returns with real people's names and [Dutch social security numbers]. With that data, criminals can commit identity fraud or break into people's homes.

The website is the brainchild of Russian Internet entrepreneur Vladimir Nesterenko. He built a system that fully automatically scours the Internet in search of all kinds of files, including leaked information. In 2017, it turned out that he had collected 4.3 million documents from 20 countries. The Personal Data Authority, the police and the National Cybersecurity Center agreed: what docplayer.com is doing is not allowed.

Meanwhile, large tech companies are also abusing this unlimited collecting frenzy to make a profit. Docplayer.nl is the main Dutch-language source for chatbots, according to research by De Groene Amsterdammer and Data School.

and

Tech companies have recently became very closed about the sources they use, but most of the Dutch texts on which AI models such as ChatGPT are trained on come from the Common Crawl database. That is a sort of blueprint of the entire Internet. That list is used by different companies in different ways. We looked at the texts that Google extracted from it: the MC4 dataset. In addition, we compared the way Google sorts texts with the filter of GPT-3, the technology behind the mega-popular ChatGPT, and we saw no significant differences. So they probably use almost the same sources.

That list includes more than forty billion words - enough to fill more than half a million novels - and appears to be rife with copyright violations, private information and fake news. In the top-two hundred most-quoted websites we found Wikipedia and just about every major Dutch newspaper, as well as the neo-Nazi conspiracy website Stormfront. The latter is only one place lower in the source list than RTL News. So from both websites AI learns about as much.

The once obscure docplayer.nl turns out to be the most most important source. This means that private information - such as documents containing evaluations of job applicants - can no longer be found only by hackers in a relatively unknown location, but now, with the right questions can be dug up by commonly used chatbots. And once a chatbot has seen data once, it won't forget it easily.

Translated with Deepl, cleaned up by me.

Here's the article which you can translate with a service of your own choice: https://www.groene.nl/artikel/dat-zijn-toch-gewoon-al-onze-artikelen

De Groene Amsterdammer is kind of like our The Atlantic, but with an investigative journalism wing.

On a related note, some news from about a week ago: OpenAI Sued for Using 'Stolen' Data, Violating Your Privacy With ChatGPT.

3 votes

[3]

bloup

July 9, 2023

Link

Some thoughts: I don’t like how the lawsuit specifically is about the act of training the model. I think as far as copyright is concerned, you should be free to do whatever you wish with your...

Some thoughts:

I don’t like how the lawsuit specifically is about the act of training the model. I think as far as copyright is concerned, you should be free to do whatever you wish with your licensed materials as long as you aren’t reproducing it an industrial scale or trying to profit off of it. For instance, in the free software community, there are many examples of businesses developing and extending copyleft licensed software for their own internal purposes without sharing their extensions and it’s never really been considered controversial until they start trying to market it as a proprietary solution. Obviously, OpenAI is selling a product based on the copyrighted training data, so this situation is different, however I’d hate to see some kind of legal precedent set that prohibits a hobbyist from building a personal LLM based on their personally owned book collection for fun.
While there is definitely tons of unlicensed copyrighted material in the dataset and this is something that needs to be addressed, I ran some tests. While it’s certainly not conclusive, for at least one of the authors, Mona Awad, I do not believe that chatgpt was ever trained on actual verbatim copies of her books. Doing some initial research, it seems like her most successful book and the one most likely to have been incorporated into the dataset is “13 Ways to Look at a Fat Girl”. It’s a collection of short stories, so I chose a random story in the middle, and asked chatgpt to generate the first paragraph of that story, and it refused, saying it would violate copyright law. I tried to be a little more clever and ask instead for it to “analyze the first 20 sentences line by line” and it actually happily complied and seemed to do what I asked, until I referred back to the actual book in question and found ChatGPT had just made up every line beyond the first one that I gave for free. I regenerated the response multiple times, never getting anything that even came close to the actual content of the book. I don’t think this is a result of some kind of stealth avoid-copyright-by-generating-nonsense routine. I have successfully been able to get chatgpt to generate verbatim excerpts of other books under copyright in the past, but only for very well known books.
Financial losses of authors due to chatgpt will be exceedingly difficult to assess. That isn’t to say that they should not be compensated for another business profiting off of their unlicensed work. But nobody is pirating books on ChatGPT. I think determining losses for the author would have to be based on some kind of prior existing statute that makes it clear what an author is owed if their work is incorporated into a commercial LLM, but we don’t have that unfortunately.
There absolutely should be a legal requirement for businesses offering any kind of commercial machine learning service trained on “publicly available data” to publish all their data sources completely in full, with routine audits by a regulatory agency certifying compliance. The fact that OpenAI gets even more secretive about their data sources under growing scrutiny leads me to believe that OpenAI executives believe there would be tremendous public outrage if it ever became clear where OpenAI gets its data from, which I find pretty troubling!

4 votes

Wulfsta
July 9, 2023
Link Parent
I think your second point probably shows that they are not supplementing the response with a search for the source material.

I think your second point probably shows that they are not supplementing the response with a search for the source material.

1 vote
Trauma
July 9, 2023
Link Parent
In response to your second point - I'm afraid that proves very little. Very popular, old, well known works are likely fed more then once to LLMs by way of being part of multiple data sets, and are...

In response to your second point - I'm afraid that proves very little. Very popular, old, well known works are likely fed more then once to LLMs by way of being part of multiple data sets, and are much more often quoted and referenced in other texts. That ChatGPT is able to reproduce any work is surprising, that it is unable to do so for a newer, less impactful book is not, and it is in no way proof or indication that the book wasn't part of the training set.

1 vote

[14]

Nergal

July 9, 2023

Link

Is there any way to confirm that ChatGPT didn't train on data which summarized the novels? I feel like the evidence wasn't as damning as they were hoping.

3 votes

[13]
Algernon_Asimov (OP)
July 9, 2023
Link Parent
In the legal complaint that's linked to the article, there are a few sections which focus on this question of whether ChatGPT had access to the plaintiffs' novels. Paragraphs 28 through 35 lay out...

In the legal complaint that's linked to the article, there are a few sections which focus on this question of whether ChatGPT had access to the plaintiffs' novels. Paragraphs 28 through 35 lay out the case for alleging that ChatGPT did have access to the novels in question.

Specificially, Paragraph 34 says this:

34. As noted in Paragraph 31, supra, the OpenAI Books2 dataset can be estimated to contain about 294,000 titles. The only “internet-based books corpora” that have ever offered that much material are notorious “shadow library” websites like Library Genesis (aka LibGen), Z-Library (aka Bok), Sci-Hub, and Bibliotik. The books aggregated by these websites have also been available in bulk via torrent systems. These flagrantly illegal shadow libraries have long been of interest to the AI-training community [...] On information and belief, the OpenAI Books2 dataset includes books copied from these “shadow libraries”, because those are the most sources of trainable books most similar in nature and size to OpenAI’s description of Books2.

Admittedly, the evidence is circumstantial - and that's probably by design; Open AI has been deliberately vague about the corpora of texts it has used to train this chatbot.

However, the allegation seems to be that the books were obtained illegally.

19 votes
1. [7]
  Pioneer
  July 9, 2023
  Link Parent
  And here lies the true problem with so many of these damn LLMs. From absolute negilence, through to obtuse refusal to explain what they're modelled on. You can't put the genie back in the lamp,...
  
  Open AI has been deliberately vague about the corpora of texts it has used to train this chatbot.
  
  And here lies the true problem with so many of these damn LLMs.
  
  From absolute negilence, through to obtuse refusal to explain what they're modelled on.
  
  You can't put the genie back in the lamp, but regulation should focus on making these models are understandable as possible to users and consumers. Where did the data come from, what was done to the data, how does it work. If businesses have to explain where their resources/products/cash comes from... then why do we not enforce the same thing on AI / Data Products (Infact, under GDPR you have to explain how a users data will be used and get agreement from end users.)
  
  8 votes
  1. [6]
    Wulfsta
    July 9, 2023
    Link Parent
    The answer here is that OpenAI only has a slight edge over other products right now, and they’re probably being this protective because they stand to gain a lot more than they’ll lose if they can...
    
    The answer here is that OpenAI only has a slight edge over other products right now, and they’re probably being this protective because they stand to gain a lot more than they’ll lose if they can cement themselves as the leader of the industry in AI.
    
    2 votes
    
    [5]
    Pioneer
    July 9, 2023
    Link Parent
    It doesn't really matter about their protectiveness to be perfectly honest. Technology is seen by so many as "just another useful tool", rather than the actual way that these are going to be used...
    
    It doesn't really matter about their protectiveness to be perfectly honest.
    
    Technology is seen by so many as "just another useful tool", rather than the actual way that these are going to be used in the future. So many are going "ha, I'll be able to have a conversation with my PA soon!" ... without realising how any of this works.
    
    I just commented elsewhere about data literacy and the lack of understanding people generally have around maths, statistics and data. Let alone an understanding of how neural nets are training and function. If we're to actually understand and make these algo's useful? They need to be publically understood, or at least de-codeable to experts who can explain and the verify that they are doing exactly what they're going to do.
    
    Does not help that the CEO of OpenAI has been going around doom & glooming it. Calling this new generation of machine learning an avenue to the end times and all sorts. No-one wants to open Pandora's Box, but in this instance we should be doing just that. Either prove you've got god-tier AI (or on the road to it) or lay bare the facts and stop with the hyperbole (Not you, the CEO's of these firms.)
    
    1 vote
    
    [4]
    Wulfsta
    July 9, 2023
    Link Parent
    I disagree about making an algorithm useful without fully understanding it. There are plenty of examples of traditional mathematical models that do not provide good results when provided noisy...
    
    I disagree about making an algorithm useful without fully understanding it. There are plenty of examples of traditional mathematical models that do not provide good results when provided noisy inputs - Kalman filters (and similar) exist to solve this sort of issue. Due to the nature of neural nets you can’t really decode them, they are black boxes. Plenty of companies are going to use this tech, and they’re going to develop methods to perform an analogous result filtering. My point is that they won’t be open about it unless forced to, because they stand to gain much more by keeping their technology opaque.
    
    2 votes
    
    [3]
    Pioneer
    July 9, 2023
    Link Parent
    You do not have to fully understand it, but you absolutely should know the inputs into any decision. That wouldn't be difficult to do as you'd just have it harvest meta-data at the point of access...
    
    You do not have to fully understand it, but you absolutely should know the inputs into any decision. That wouldn't be difficult to do as you'd just have it harvest meta-data at the point of access per query. Then you can trace that make to a source.
    
    If you can corroborate that the source data was used, there's the instance. It's a powerful tool against misinformation systems as you could say;
    
    10000 possible data points from Reddit.
    9000 were from {questionable subreddit name}, 500 were from {meme subreddit}, 500 were from {news subreddit} and could generate an understanding of where the query got its data and the bias it may show.
    
    My point is that they won’t be open about it unless forced to, because they stand to gain much more by keeping their technology opaque.
    
    Just like anyone else to be honest. But these tools go beyond "lol, profit" and into the realms of "Fuck me, social engineering to the max" and we really need to get our heads collectively around their usage.
    
    1 vote
    
    [2]
    Trauma
    July 9, 2023
    Link Parent
    Either I completely misunderstand what you are proposing or you completely misunderstand how an LLM works. First of all, they're not algorithms, they're heuristic. The beauty of a neural net is...
    
    Either I completely misunderstand what you are proposing or you completely misunderstand how an LLM works. First of all, they're not algorithms, they're heuristic.
    
    The beauty of a neural net is that you don't have to understand the problem domain to train it to give you a solution. If you show it enough pictures of apples and pears it will learn to distinguish between them, even if you couldn't describe how you do it in form of an algorithm.
    
    And since we don't understand what human language actually is and how we form it, despite centuries of study, we don't have good algorithmic solutions to generating natural language.
    
    The bad thing about a neural net is that it can't give you reasons for coming to it's decisions. If you show it a picture of a banana and it says "apple" because all it knows are apples and pears all you can measure is the strength of confidence in the decision, but you can't ask which picture of an apple gave it the idea, because all the training inputs got mushed into a network of weighted connections. That's precisely the function of neural networks, mush data into weights and then use them as a heavily compressed lossy aggregate of the input data.
    
    3 votes
    
    unkz
    July 9, 2023
    Link Parent
    Not really disputing what you’re saying but just as a point of interest, and you may already know this, but it is often possible (albeit time consuming) to get a good sense of which items of the...
    
    Not really disputing what you’re saying but just as a point of interest, and you may already know this, but it is often possible (albeit time consuming) to get a good sense of which items of the training corpus contributed most significantly to an image identification by measuring embedding distances. I’m not sure how analogous LLMs would be in this sense.
2. [4]
  bloup
  July 9, 2023
  Link Parent
  I just want to point out that the Internet Archive hosts over 750,000 books that were published in 1924 or earlier, which would have all been considered public domain under US copyright law at the...
  
  I just want to point out that the Internet Archive hosts over 750,000 books that were published in 1924 or earlier, which would have all been considered public domain under US copyright law at the time of ChatGPT’s reported cutoff date.
  
  Do you have any more context describing the “nature” of books2?
  
  4 votes
  1. Algernon_Asimov (OP)
    July 9, 2023
    Link Parent
    I only know what's in the legal complaint which I read for the first time shortly after I posted this article.
    
    Do you have any more context describing the “nature” of books2?
    
    I only know what's in the legal complaint which I read for the first time shortly after I posted this article.
    
    3 votes
  2. schmonie
    July 9, 2023
    Link Parent
    The books2 / books3 datasets have been around for a long time, and have appeared in countless published papers on new language models. Up until very recently, most of these models have themselves...
    
    The books2 / books3 datasets have been around for a long time, and have appeared in countless published papers on new language models. Up until very recently, most of these models have themselves been open-source and freely available. IMO OpenAI is getting heat for this because (1) they are being deliberately secretive about the sources of their training data (even if the rest of the industry is using the same data in question) (2) they have a direct profit-incentive from their model, and money has now exchanged hands for something that may have been created with copyrighted work (3) it put this sort of situation on the map for book publishers, who probably had little (financial) reason to care when these datasets were just being used for pure research.
  3. Trauma
    July 9, 2023
    Link Parent
    That's fine if you want your chat bot to talk like someone from 1900, but that wouldn't create the impact ChatGPT has.
    
    That's fine if you want your chat bot to talk like someone from 1900, but that wouldn't create the impact ChatGPT has.
3. ourari
  July 9, 2023
  Link Parent
  The Washington Post had an article back in April about where chatbots get their data. It also pointed to pirated books: https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/
  
  The Washington Post had an article back in April about where chatbots get their data. It also pointed to pirated books:
  
  The data set was dominated by websites from industries including journalism, entertainment, software development, medicine and content creation, helping to explain why these fields may be threatened by the new wave of artificial intelligence. The three biggest sites were patents.google.com No. 1, which contains text from patents issued around the world; wikipedia.org No. 2, the free online encyclopedia; and scribd.com No. 3, a subscription-only digital library. Also high on the list: b-ok.org No. 190, a notorious market for pirated e-books that has since been seized by the U.S. Justice Department. At least 27 other sites identified by the U.S. government as markets for piracy and counterfeits were present in the data set.
  
  https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/
  
  3 votes

[3]

schmonie

July 9, 2023

Link

There is a bit of a stink going on in the language technology space right now regarding the legality of a very popular dataset called books3, which is a dataset comprised of pirated books from the...

There is a bit of a stink going on in the language technology space right now regarding the legality of a very popular dataset called books3, which is a dataset comprised of pirated books from the (I think) now-defunct site "bibliotik."

This dataset is used in open source language models, but probably also used by OpenAI at some point. While the dataset has been around for years now, publishers are now calling into question whether or not it's legal to obtain or even use this data.

It also calls into question the license that model creators can put on their work. Most off-the-shelf language models are pretrained to simply recite their training data verbatim--So if you create a model that does this with copyrighted work, do you have the authority to actually license your model as something more permissible like Apache 2.0?

A popular dataset that uses books3 is called "The Pile" which is a varied collection of training data for causal language models. ElutherAI, the creators of "The Pile" actually address this copyright issue all the way back in 2020 in the original paper. Their reasoning is that the books, even in the training data, are not kept in their original form anyways, so fair-use should apply since the work is transformative (even for the dataset itself, let alone the downstream models(.

The difficulty here is that none of this has even been tried in a court before, so nobody knows for certain what the legality is of any of this.

2 votes

[2]
Algernon_Asimov (OP)
July 9, 2023
Link Parent
I believe that this class-action lawsuit is an attempt to test that legality. I think this is a test case, to force a legal ruling that would require machine learning developers to pay content...

The difficulty here is that none of this has even been tried in a court before, so nobody knows for certain what the legality is of any of this.

I believe that this class-action lawsuit is an attempt to test that legality. I think this is a test case, to force a legal ruling that would require machine learning developers to pay content creators of all varieties for the content they feed into their large language models.
1. schmonie
  July 9, 2023
  Link Parent
  My thoughts too. From what I've heard, the publishers do have a reasonable shot at this. If something like this goes to court, and a ruling is made, it's going to have outsized impact on the AI...
  
  My thoughts too. From what I've heard, the publishers do have a reasonable shot at this. If something like this goes to court, and a ruling is made, it's going to have outsized impact on the AI industry. IANAL, but I feel this could similarly call into question much more of the crawled data that modern language modeling datasets are based upon.
  
  1 vote

Link information

39 comments