It would need to have a proof of residency, at a minimum. For me that was an ID or utility bill. And the AI has to follow the rules of its local library: Only a handful of digital and maybe 20...
It would need to have a proof of residency, at a minimum. For me that was an ID or utility bill.
And the AI has to follow the rules of its local library: Only a handful of digital and maybe 20 physical books at one time.
Isn't that specifically for one thing like taxes or something? AI already can't hold a copyright... there are a whole bunch of legal presidents about to be set in the next few years
a company can be a person,
Isn't that specifically for one thing like taxes or something?
AI already can't hold a copyright... there are a whole bunch of legal presidents about to be set in the next few years
The original case was about the speech rights of corporations (specifically, political speech). Off topic, but for context that decision (Citizens United v FEC) led directly to the emergence of...
Isn't that specifically for one thing like taxes or something?
The original case was about the speech rights of corporations (specifically, political speech). Off topic, but for context that decision (Citizens United v FEC) led directly to the emergence of Super PACs and a subsequent tsunami of "dark money" flowing into politics.
The concept of corporate personhood in American jurisprudence is much older then Citizens United, going back at least to Santa Clara County v. Southern Pacific Railroad Co. (1886) or arguably to...
The concept of corporate personhood in American jurisprudence is much older then Citizens United, going back at least to Santa Clara County v. Southern Pacific Railroad Co. (1886) or arguably to Trustees of Dartmouth College v. Woodward (1819).
It certainly isn't for everything. I don't think corporations have mandatory schooling in their formativt years for one. Learning addition and all the countries, etc
It certainly isn't for everything. I don't think corporations have mandatory schooling in their formativt years for one. Learning addition and all the countries, etc
IANAL but as someone who has done a lot of amateur studying of this area of law, I don't think your presentation of AI training as definitely fair use is good. AI training may or may not be fair...
Exemplary
Fair use argues they don't have to.
I want to expand a bit on my above point. Fair use makes exceptions for research, along with transformative works
IANAL but as someone who has done a lot of amateur studying of this area of law, I don't think your presentation of AI training as definitely fair use is good. AI training may or may not be fair use; it depends both on the specifics of an individual case as well as on legal precedent that has not been set yet. At the moment we can speculate and argue about whether AI training is/should be fair use, but anyone who claims it is or isn't with 100% certainly is betraying a poor understanding of the situation, because it absolutely is not 100% clear that AI training is or isn't fair use legally.
The portion of the law you quote directly precedes the list of four factors that judges actually use in analyzing whether something is fair use. The purposes listed there are indeed purposes that are intended to be protected by fair use, and that's accounted for in the first factor:
the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
The law is actually explicit here in saying that the commercial nature of a use is part of the first factor in a fair use analysis. It's definitely not the only part, and it's perfectly possible for something to be commercial and still be fair use. But it's certainly not completely irrelevant. Furthermore, even ignoring the commercial aspect, even noncommercial works won't necessarily always be fair use, because purpose/character of the use is only one of the four factors. Even if the commercial factor is considered irrelevant because AI training is research (and it certainly is research, though as someone in the field with a linguistics background I think "research into the underlying nature of language" is... overselling it), it's possible to still not be fair use even if first factor weighs in favor of fair use. It is one of four factors for a reason.
For example, the unpublished “nature” of a work, such as private correspondence or a manuscript, can weigh against a finding of fair use. The courts reason that copyright owners should have the right to determine the circumstances of “first publication.” Use of a work that is commercially available specifically for the educational market is generally disfavored and is unlikely to be considered a fair use. Additionally, courts tend to give greater protection to creative works; consequently, fair use applies more broadly to nonfiction, rather than fiction. Courts are usually more protective of art, music, poetry, feature films, and other creative works than they might be of nonfiction works.
I wanted to emphasize that particular sentence because it shows how the purpose and character of the use can be wholly educational, but it would be unlikely to be fair use. Of course the actual weight of this factor for or against fair use depends wholly on the work being used, not on how it's used, so it's impossible to make broad strokes statements on this factor for AI training. It'll depend on what's being used and how it was acquired.
The third factor is:
the amount and substantiality of the portion used in relation to the copyrighted work as a whole;
Of course, AI training almost always uses the entirety of a given work. I suppose you could argue that use of the whole work is necessary for their purposes. A general rule of thumb is that you can use as much as is needed for your purposes, even if it's the whole work. But I'm still not sure that would fly here -- one could argue that unlike with criticism and satire, which is usually targeted at a specific work and thus must use portions of it, AI training like this doesn't require the use of any one specific text, just large amounts of texts in aggregate, and therefore for each individual work it doesn't strictly need to use any of it. Not sure whether that argument would fly in court either. But in general I think this factor weighs pretty clearly against fair use for AI training. Which is fine -- it can still be fair use if the third factor weighs against fair use! There are lots of uses we don't want to forbid that use the entire work, and this is the reason we have four factors in this analysis.
The fourth factor is:
the effect of the use upon the potential market for or value of the copyrighted work
This is the factor that's probably going to actually get litigated the most (both in the court of public opinion as well as in actual court once it gets there), and it's also almost definitely where the commercial element is going to play a bigger role. I think it varies a ton based on the particular work in question. I think fiction writers have a pretty weak claim here (no one is using AI models like these to replace reading/buying their books) EXCEPT for in the case that their books were pirated rather than purchased for training. To quote Columbia University Libraries' page again:
To evaluate this factor, you may need to make a simple investigation of the market to determine if the work is reasonably available for purchase or licensing. A work may be reasonably available if you are using a large portion of a book that is for sale at a typical market price.
That's not to say this definitely invalidates that factor for those books, but I think the piracy does definitely weaken OpenAI's case with regards to this factor (and also maybe the second factor as well; IANAL so I'm not an expert on where that part of things would fall in a judge's analysis). I think a nonfiction or reference work, such as an encyclopedia, is far more likely to have a case that the resulting LLMs are being used commercially in a way that will negatively affect the market for their original works. But I think visual artists have an even stronger case when it comes to this factor -- it is VERY obvious even to a layperson that AI art generators will have huge affects on the market for visual artists.
However, actual analysis of the other three factors isn't even the biggest problem with confidently declaring that AI training is fair use. The biggest issue is that fair use analysis is something that assessed on a case-by-case basis. The specific facts of a case have a HUGE effect here, and hopefully seeing the factors laid out makes it clearer that even for a specific AI mode it's possible that some of the use was fair and some wasn't. Moreover, we can't even try to generalize from other AI training cases that generally judges consider it to be fair use, because we don't have any court decisions in which a fair use analysis was done on AI training yet. As someone who works in AI and has a strong amateur interest in copyright law, I'm super excited for when we do get one! But we don't have one yet, and that means it's not really possible to make even generalizations about how AI training is seen when it comes to copyright and fair use. By contrast, we do have precedent on some other AI-related copyright issues (such as the inability of the output of generative AI models to be copyrighted, which is very well-established already).
Maybe judges will find that OpenAI's use of copyrighted works for training data was 100% fair use all the way! Even if that happens, it will only potentially inform future lawsuits against AI companies for doing similar things; it will still absolutely be possible for another instance of AI training on copyrighted work to be found to not be fair use, because the individual facts matter so much in fair use cases. Without any clear precedent on this issue to point to, it is inaccurate to confidently claim that AI is considered fair use. Maybe you think it is! And maybe you'll later turn out to be right! But even a lawyer or judge shouldn't frame it as anything more than their professional opinion until it's actually been decided in court.
I think we've had about as much productive discussion on the copyright side of things as two non-lawyers with different opinions on a fairly novel legal question can tbqh! Aside from agreeing to...
I think we've had about as much productive discussion on the copyright side of things as two non-lawyers with different opinions on a fairly novel legal question can tbqh! Aside from agreeing to disagree and seeing what happens, of course.
As for whether LLMs offer much insight into the fundamental nature of language, my background prior to getting into NLP was in theoretical linguistics. Language models are super useful for research about language, but I think from that acientific background it doesn't teach us anything fubdamental that we don't already know. It's useful from a perspective of gathering and analyzing data, but when it comes to the Big Questions, they've either already been answered or aren't really addressed by this type of AI. It can help us study human language use, but it doesn't really illuminate anything truly fundamental about How Language Works that we don't already know.
If AI used some sort of explicit underlying structure in how it analyzed language, that could be used to argue for a theory of either syntax or psycholinguistics or both that maps closely onto it. But since these models are built to rely purely on statistics (which is pretty much every model for a long time) and the models themselves are black boxes, we can't really do this. Even if we were able to do this, though, it's not necessarily the case that the way these language models underlyingly model syntactic structure is the same way human beings do, so even that would certainly not stop the squabbling on how syntax in human language is truly underlyiny structured.
The biggest thing these models have done in this regard is hugely validate the idea of distributional semantics, since its principle is foundational to pretty much every remotely modern languahe model. But even that isn't really teaching us anything fundamental -- that linguistics theory started in the 1950s and inspired those approaches to language modeling rather than the other way around. So while it's super cool to see how thoroughly the concept has been validated by LLMs, it's not something that imo teaches us anything new and fundamental.
When I think of questions that advance our fundamental understanding of language, my mind goes to questions about the existence and nature of the human language faculty and whether certain features/structures are so fundamental to human language that no natural language will occur without them. These are the sorts of things that are big, fundamental questions in linguistics that are largely unsolved and very controversial. But language models are more or less useless in studying these.
I don't want to downplay the validity of research into language models, fwiw. It's great stuff and these models can be practically super useful in conducting theoretical linguistics research, in addition to their myriad other practical applications. But most of their utility when it comes to answering fundamental questions about language comes from using them as a tool in studying human language, as the models by themselves mostly reflect what we already know or are black boxes.
I hope this comment was somewhat coherent and interesting... it's a bit late at night my time so let me know if anything is confusingly written and I'll try to clarify when it's daytime for me 😅
This just doesn't seem morally right to me. For research purposes feeding LLM bunch of research material of course is fine. But when the product is "a thing that can algorithmically mimic anything...
This just doesn't seem morally right to me. For research purposes feeding LLM bunch of research material of course is fine. But when the product is "a thing that can algorithmically mimic anything because it ate the internet", your product isn't the algorithm, its the material it has consumed and it seems just wrong not to recognize that.
I guess the courts will decide this free use argument for us, but man it just seems bad for society as a whole to disincentivize the creation of original creative works, original reporting etc. But then again you can argue that this also creates some new works and can streamline the process when used as a tool.
We probably have agree to disagree but there's a little misunderstanding: I'm not saying these AI models should be made illegal or denied of free use, like I said on my previous comment using them...
We probably have agree to disagree but there's a little misunderstanding:
If you were to make this illegal, or require additional permission, you would kill research in more avenues of the human experience then I could even begin to count, along with a bunch of technical advancements too. No researcher, outside of a massive company would be able to do this. It would also almost certainly kill any hope of open source versions of these tools.
I'm not saying these AI models should be made illegal or denied of free use, like I said on my previous comment using them as tools for research purposes seems totally fine to me. What I mean is that there's a real difference between selling a tool that's able to analyze, interpret and reorganize data and the currently available services that are actually selling the meta data (or just data, I think it's kinda semantics) without you having to do anything but to prompt it.
I might've explained this poorly but hopefully you understand the difference (I think there is)?
Just wanted to fix that misunderstanding, other than that I think I understand your argument and just don't agree.
You may be interested in info law scholar Ben Sobel's commentary and paper regarding the inadequacy of the fair use doctrine in handling AI art issues. He wrote about the difference between...
You may be interested in info law scholar Ben Sobel's commentary and paper regarding the inadequacy of the fair use doctrine in handling AI art issues. He wrote about the difference between "non-expressive uses of AI which do not infringe copyright" (for example, Google Books), vs. "expressive, market-encroaching uses of AI which may infringe copyright" (for example, generative AI models). While the former "promulgates information about a work", the latter "usurps that work's expressive value" and could "diminish the demand for the human-authored works on which it trained". He explains that this distinction is relevant to "the two most important statutory fair use factors: factor one, the purpose and character of the use, and factor four, the effect of the use upon the potential market for or value of the copyrighted work.":
Thus, to the extent that this AI derives value from its input data, it engages not with mere facts about copyrighted materials, but instead with the protected, expressive aspects of those materials. Such an expressive purpose does not make a use per se non-transformative. But it does make the rationale of non-expressive fair use unavailable.
I think we need to look at the purpose of copyright. The basic idea is to give a limited "right" to reproducing ones work, so that there is an economic incentive to create stuff. But to avoid that...
I think we need to look at the purpose of copyright. The basic idea is to give a limited "right" to reproducing ones work, so that there is an economic incentive to create stuff. But to avoid that this "right" being used to limit others creations, we have "Fair Use" or "Fair Dealings" which grants exceptions. The basic idea in all of this is a system which stimulate creation.
So, if a graphic artist, who have spend a lifetime mastering his craft, can have his style analysed so that anyone can create pretty much the same in a matter of seconds, this does stimulate creation. A person making a tabletop game can give it superb graphic far beyond what his economy would normally allow.
The flipside of this is that there are no longer any actual value in learning the craft. So fever would bother, because what's the point. So this does hinder creation.
So the final verdict ... who the heck knows. But it doesn't really matter, because the tech is out there, and I don't think regulation can do much about that.
Seriously. OpenAI has a market cap of $80 billion. Traditional established modestly successful authors may receive an advance of $1k-10k for their works. So let's say OpenAI puts $10 billion into...
Seriously. OpenAI has a market cap of $80 billion.
Traditional established modestly successful authors may receive an advance of $1k-10k for their works.
So let's say OpenAI puts $10 billion into content farming. They could offer authors $5k advances. The authors still get to sell their books, but in turn for paying the advance, OpenAI can use the novel in their model training. $10 billion would get OpenAI access to 2 million full-length novels. And they can commission works in whatever genres or forms they need most.
Or, they could likely for even cheaper just directly pay authors for the rights to their work. I imagine many small authors would be willing to license their works for a few hundred bucks. After all, you're just trying to build a general language engine; you don't need to hire A-list authors for this. There are no shortage of small-time authors out there would let you throw their novel into the grist mill for a few hundred bucks. There are no shortage of aspiring authors who would happily take a $5k advance, write a novel, and let you use it in your AI training as long as they could retain the other rights to the work.
So that’s it. We’re locked in with only OpenAI being able to train things to the level they did. De facto monopoly only broken by china who doesn’t care about the ownership of bits on the internet.
Seriously. OpenAI has a market cap of $80 billion.
So that’s it. We’re locked in with only OpenAI being able to train things to the level they did. De facto monopoly only broken by china who doesn’t care about the ownership of bits on the internet.
Yeah, this was my exact thought when this story broke last week. Claim “the market needs regulation!” knowing that you (or friendly parties) will write the regulations such that you build your...
Yeah, this was my exact thought when this story broke last week. Claim “the market needs regulation!” knowing that you (or friendly parties) will write the regulations such that you build your moatafter you have extracted enormous value from using other people’s content freely, locking the current players in as what we’ll have “forever”
I mean, the logical thing to do would be to forcibly tear down everything they built and force them to do it publicly audited, from scratch. Bet that shrinks that $80 billion a good bit. Plus if...
I mean, the logical thing to do would be to forcibly tear down everything they built and force them to do it publicly audited, from scratch. Bet that shrinks that $80 billion a good bit. Plus if they are found guilty of violations, and discovery reveals the extent of it, every copyright holder could potentially seek punitive damages of $100k per work, per violation. And if they use RIAA math, every time they copied that work to another computer is a potential violation.
Especially now that they'd need to keep track of a massive compliance database.
There is an alternate path. You could get authors to donate their works to an open source nonprofit project that is prohibited by its charter from engaging in any commercial activities, or owning...
There is an alternate path. You could get authors to donate their works to an open source nonprofit project that is prohibited by its charter from engaging in any commercial activities, or owning any subsidiaries that do so. (Trying to avoid OpenAI's weird structure here.) These AI models could be created as public projects by people donating their time and their computing resources to the project. Authors could in turn freely sign over the use of their work for the purpose of AI training.
Or, amusingly, we could train our LLMs from public domain works. Think text from before the 1920s or so. Novels, newspaper archives, scientific journals, databases of private diaries, etc. Then the resulting AIs can be not only racist, but antiquated racist! They'll have to constantly filter them from defaming the Irish and Italians.
You don’t get to GPT4 quality LLMs by having random works being donated. Words aren’t just fuel you put in the vehicle. Quality matters. Extensive, varied and well curated datasets is what’s...
You don’t get to GPT4 quality LLMs by having random works being donated. Words aren’t just fuel you put in the vehicle.
Quality matters. Extensive, varied and well curated datasets is what’s needed. A bunch of low quality fanfiction will not cut it.
Furthermore it’s like looking at what’s been achieved by the internet in the past few decades and saying “Yep, let’s not make something amazing with all this — instead let’s ask people to redo it all for free”.
OpenAI broke copyright. Yep absolutely. So did Spotify and Netflix. So did a LOT of other companies throughout the years. Companies providing useful services for the world, from entertainment to productivity.
Is it ethical to profit from it the way they do? Absolutely not. Is it reasonable to ask them to build it under the current framework? No, it’s not reasonable either.
All the framework is doing is preventing ethical people from building these amazing tools and leaving it only for those willing to break the rules. We need a different, sustainable system, not more people slaving their work into the public domain.
There's https://huggingface.co/Mitsua/mitsua-diffusion-one which is a model being trained from scratch using only public domain and voluntarily donated material.
I literally had this conversation with a friend of a friend who works on Copilot. (paraphrasing) What’s funny is she doesn’t even work for OpenAI. She works for Microsoft. To her it’s more like...
I literally had this conversation with a friend of a friend who works on Copilot.
Me: Even if they rule against OpenAI and they need to start over from scratch no one else can get the institutional knowledge you get from building those models in the first place.
Her, beaming: Exactly!
(paraphrasing)
What’s funny is she doesn’t even work for OpenAI. She works for Microsoft. To her it’s more like rooting for her home team and less a way to get rich. FAANG companies are the Bay Area’s sports teams.
Note that they aren't just talking about full books, here's the actual quote cited in the article that the headline is from: For proper stories I can at least see the argument—and if nothing else...
Note that they aren't just talking about full books, here's the actual quote cited in the article that the headline is from:
“Because copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.”
For proper stories I can at least see the argument—and if nothing else they should have to buy the book—but I dunno if extending that to "pay every Twitter user per Tweet" is as reasonable. (No idea about the legality though, I'm talking in the plain ethics sense.)
Presumably they do. If someone could prove that OpenAI or anyone else was using pirated material in their training data, then that would set them up to argue that the number of infringements...
Pay for it.
Presumably they do. If someone could prove that OpenAI or anyone else was using pirated material in their training data, then that would set them up to argue that the number of infringements somehow scales with the amount of training. Even with conservative penalties for copyright infringement, that would obliterate a defendant.
This is almost exactly what is currently being litigate in regards to this whole mess. They paid what the rest of us pay for most of this stuff. Nothing. The problem is that they're using it to...
This is almost exactly what is currently being litigate in regards to this whole mess. They paid what the rest of us pay for most of this stuff. Nothing. The problem is that they're using it to make a profit. I'm not reading a new york times article and then re-posting it as my own writing for my newspaper that I sell. I explicitly do NOT have the right to do that, and if I want the right to do that, I have to pay. And we know they do not have those kinds of deals in place because it'd be insane.
They're arguing the AI is trasnformative enough that it counts as a new work or whatever, and those who feel ripped off are calling bullshit since it can, with proper prompts, literally just recreate the original work in some cases, or has the signatures of the artists used in the resulting output.
And that's just focusing on art, not video or text where it looks even worse. Can I just republish someone else's book with all the names changed and not be infringing? Probably not, but you can clearly use the AI to do exactly that, and so on.
Where I see it getting really thorny is when you start asking, "how similar/different are the processes of LLM training vs how humans learn?" I was not born learning how to speak any language....
Exemplary
Where I see it getting really thorny is when you start asking, "how similar/different are the processes of LLM training vs how humans learn?"
I was not born learning how to speak any language. Yet, now I can speak and write with some reasonable coherency. How did I gain that ability? While some of it was formally trained, much of it wasn't. I learned most of my speech abilities simply by interacting with people, picking up vocabulary, syntax, grammar, etc.
But how does such learning work? A young child will hear a word and try using it. They initially have no idea what it means. However, by using it, they receive feedback from other people around them. Eventually through trial and error, they learn the correct way to use the word. Even in formal educational settings, much of learning occurs through a practice/feedback cycle. The processes we use to learn don't really seem all that different to me from how a LLM does it. Our minds are just more purpose-built for the task of learning language, so we do it a lot more time and energy efficiently than an LLM does.
But in terms of copyright, there's no obvious easy answer here. If you say, "no copyrighted works for LLMs," then you are now applying a more stringent standard to AI-created works than applies to human-created works. If that standard applied to humans, no person who has ever read a copyrighted work would be eligible to write a book.
Rather, I think the best path forward to address copyright and LLMs is to apply the same standards that apply to human-created works. The issues raised by LLMs are not new; the courts have been dissecting the question of "how close does it need to be to be copyright infringement" for generations. Does using a common saying, even if you first saw it in a copyrighted work, count as infringement? Probably not. How about a single whole sentence? What about a few sentences? What if it's quoted and properly cited? What if I copy your whole book, but I make it one big quote and properly cite it? These are the issues that the courts have spent centuries hammering out the jurisprudence on. I think we're probably best just applying the same copyright standards to AI writing as we do to human writing. If something is too close for humans to legally do, it's too close for AI to do.
And ultimately, this is probably the only viable path going forward. It's already often difficult to tell if something is AI-generated or not. In time, it will become even more difficult and eventually impossible. Even if you pass a draconian law that says, "no AI generated work can be published for profit," you then have to define what "AI generated" means. Search engines are all implementing AI into their engines. If I use a search engine while doing research for a novel, does that mean that novel counts as AI generated? If AI writing is indistinguishable from human writing, how do you even begin to prove that works are AI generated? What if I write a novel completely myself, but I use an AI-translation software to translate it into another language? What if I write the novel, but I use an LLM-powered spelling and grammar checker to aid in my proofreading? What if I do all the research, writing, and editing myself, but the initial idea for the work was inspired by an AI-generated blog post I read? Etc. As LLMs become more and more integrated into everyday life, trying to create any work without at least a tiny amount of AI assistance will be near impossible. Unless you do all the research with physical books at a library, and write your novel on a typewriter, some AI will likely be involved in some small way. With how ubiquitous this stuff will likely become in every aspect of our lives, trying to ban any use of AI in written works seems folly.
So I think the best path forward may just be to treat LLM-generated works just like human-generated works. Does your LLM tool regularly output paragraphs that have snippets taken from authors that, if done by a human, would count as copyright infringement? Then your company is now liable for that infringement. Does your model output a space opera that is formed from a distillation and synthesis of the thousand top sci fi books ever written, but isn't an obvious and direct copy of any of them, and it doesn't use long direct quotes from any of them? Well, a human wouldn't get in trouble for doing that, so neither should your AI.
By applying this standard, you completely avoid the impossible problem of trying to determine whether a given work is AI-generated or not. And you also avoid having to wrangle with precisely how much AI contribution is needed before a work counts as "AI generated."
As someone with a background in linguistics, I just want you to know that you have unintentionally opened almost every possible can of worms when it comes to this part of the field in these two...
I was not born learning how to speak any language. Yet, now I can speak and write with some reasonable coherency. How did I gain that ability? While some of it was formally trained, much of it wasn't. I learned most of my speech abilities simply by interacting with people, picking up vocabulary, syntax, grammar, etc.
But how does such learning work? A young child will hear a word and try using it. They initially have no idea what it means. However, by using it, they receive feedback from other people around them. Eventually through trial and error, they learn the correct way to use the word. Even in formal educational settings, much of learning occurs through a practice/feedback cycle. The processes we use to learn don't really seem all that different to me from how a LLM does it. Our minds are just more purpose-built for the task of learning language, so we do it a lot more time and energy efficiently than an LLM does.
As someone with a background in linguistics, I just want you to know that you have unintentionally opened almost every possible can of worms when it comes to this part of the field in these two paragraphs lol. I actually agree with most of what you say here and elsewhere in your comment (though I think the way LLMs learn is further from what we know of how humans learn the more you get into their underlying structure and points of failure) but I just wanted you to know that I could pick out a couple sentences from these paragraphs and start a huge fight at any meetup of linguists lmao.
Ok, now I'm really curious. Give me the juicy details! Just how many academic traditions did I completely step all over? :D
As someone with a background in linguistics, I just want you to know that you have unintentionally opened almost every possible can of worms when it comes to this part of the field in these two paragraphs lol.
Ok, now I'm really curious. Give me the juicy details! Just how many academic traditions did I completely step all over? :D
The big thing is about our minds being purpose-built for language. There's a lot of disagreement within linguistics about exactly to what degree this is true -- the idea that we have something in...
The big thing is about our minds being purpose-built for language. There's a lot of disagreement within linguistics about exactly to what degree this is true -- the idea that we have something in our cognition purpose-built for language. This on its own is only a little controversial, but it dovetails nicely into the holy grail of starting linguistics arguments: Universal Grammar.
Universal Grammar is a very historically important linguistics theory originally introduced by Noam Chomsky. It's also by far the best way to start arguments, as people tend to be passionately for or against it. To try and simplify it to the fundamentals in as non-technical a way as I can:
Kids learn language fast and well. Too fast and well to be learning solely from exposure to language around them plus trial and error. They don't have enough exposure to learn so much complex grammar from nothing. This is known as the Poverty of Stimulus and isn't super controversial on its own afaik.
To account for this, UG theorizes there are some very low-level fundamental language structure rules that we're all born with. This is the titular Universal Grammar.
The Universal Grammar is pretty low-level and allows for languages to vary in a bunch of different ways (called Parameters). They're like switches you can switch on or off for different languages that result in differences in their structure. This way when kids are learning language as babies, they aren't having to learn the super fundamental underlying structure, but instead they're learning which "switches to flip".
There's more to it and it's varied a ton as a theory over time and between linguists. But that's the core idea. It is... controversial. You can start arguments by asking about it, assuming the group of linguists you're in is diverse enough in their theoretical framework. Search for Universal Grammar on r/linguistics (which isn't even principally populated by people with formal education in linguistics) and you'll see what I mean almost immediately.
Related to UG, there's a linguist named Dan Everett who claims to have found a single language in the wild that violates one of the most fundamental parts of UG. He will inevitably come up when an argument about UG gets suitsbly heated. Just bring up his name around linguists and you'll get a ton of strong opinions. I'm holding back on sharing my own right now for the sake of not making this comment even longer and more rambly than it already is.
That's really interesting. Thanks for sharing! I think I've probably heard of Universal Grammar, Chomsky, and such somewhere along the line. But I couldn't recall the terminology. But I've...
That's really interesting. Thanks for sharing! I think I've probably heard of Universal Grammar, Chomsky, and such somewhere along the line. But I couldn't recall the terminology. But I've definitely heard before of the idea that the structure of the brain is hard wired for language. I've definitely heard this concept as a possible explanation for why LLMs take so long to learn even basic language.
And by "so long," I mean the amount of text they have to digest. Some searching suggests GPT3 was trained on 8 million web pages, and GPT4 an order of magnitude beyond that. Let's say each web page has 500 words on it, and 50 million web pages are required to train an LLM, for a total of 25 billion words. The average adult, let alone a young child, can read maybe 200 wpm. To read through all the text used to train an LLM would take an adult 125 million minutes, or 237 years. It would take far more than a human lifetime to read through all of the text necessary to train a modern LLM. Even if this very rough estimate exaggerates by an order of magnitude, it would still take you decades to read through all the text used to train these models.
But it's actually much, much worse. This is writing, which is something you could theoretically do continuously. I can sit down in front of my computer and spend an entire day doing nothing but reading if I really want to. But we don't learn our initial language understanding from writing, we learn it from speaking. It's very important to talk to your children and let them overhear conversation, but no parent sits their 2 year old in a high chair and reads them books 16 hours a day, everyday. (At least I sure as hell hope not.) In reality, young kids probably only have the opportunity to actively hear people talking at most what, 20% of the day? Yet they still manage to learn.
In short, we are far, far more efficient than learning language than our best LLMs. And Universal Grammar, or just the more vague concept of the brain being "hard wired" for language, is a common explanation of this.
Oh yeah, the sheer amount of data you need to train LLMs to reach their current abilities is a great example to use for poverty of stimulus -- and the poverty of stimulus is also a great argument...
Oh yeah, the sheer amount of data you need to train LLMs to reach their current abilities is a great example to use for poverty of stimulus -- and the poverty of stimulus is also a great argument against people who think LLMs are the same as how language works in the human brain.
This makes me miss back when I was doing proper theoretical linguistics... but unfortunately academia isn't a great place for money OR good work-life balance 😔
I don't think that's the right metaphor here. The AI is (in the cases we care about) not re-creating the training data for distribution. That is a problem, and AI companies probably can/should be...
I'm not reading a new york times article and then re-posting it as my own writing for my newspaper that I sell.
I don't think that's the right metaphor here. The AI is (in the cases we care about) not re-creating the training data for distribution.
it can, with proper prompts, literally just recreate the original work in some cases, or has the signatures of the artists used in the resulting output.
That is a problem, and AI companies probably can/should be liable for that. But from my reading of the article (I haven't looked further) into the lawsuits the article talks about, it sounds like the complain is on the copyrighted content's use for training. Not for the model's ability/tenancy to distribute the copyrighted content.
I agree that this is a problem, but In the event that we end up with a court decision in favor of OpenAI I'm really hoping to see someone start distributing small models trained specifically on...
That is a problem, and AI companies probably can/should be liable for that. But from my reading of the article (I haven't looked further) into the lawsuits the article talks about, it sounds like the complain is on the copyrighted content's use for training. Not for the model's ability/tenancy to distribute the copyrighted content.
I agree that this is a problem, but In the event that we end up with a court decision in favor of OpenAI I'm really hoping to see someone start distributing small models trained specifically on certain texts (like textbooks) that can recreate whole pages based on a specific prompt. I would love it if the college textbook industry could be reshaped by AI.
I can imagine an ai model like "General Chemistry AI" trained on "General Chemistry: Principles, Patterns, and Applications" 3rd edition
Facts aren't copyrightable. Anyone can write a textbook for chemistry, physics, math, language, whatever. The "flavor" text is copyrightable; the wordy examples, the introductions, the discussion...
Facts aren't copyrightable. Anyone can write a textbook for chemistry, physics, math, language, whatever. The "flavor" text is copyrightable; the wordy examples, the introductions, the discussion of the facts and so on, those are all unique elements that are copyrightable. But the actual core of the subject, that's factual and "open source."
Modern universities aren't about teaching. They're about forcing students to buy textbooks. Which come tied to websites, which require a unique one-time use code to authenticate your student account to. So you can't copy, borrow, or share someone else's textbook. You can't use last year's textbook. The companies aren't interested in that; they want each student to pay several hundred dollars, per class, for a new textbook.
They entice professors with online homework and quizzes and so forth, which go through the website, which needs the code. So the professor says "fuck it, I'll use Pearson's 2024 Chemistry 101" and three classes of thirty students each have to pay $295. And where a professor might not be down with that, their bosses in College Administration are (thanks bribes and payoffs!) and will lean on the prof to use the newest locked down textbooks.
It's a scam. But AI can't fix that. AI, anyone, can already write a textbook. They just have to sit down and do it. Textbook companies have corrupted universities though, and fixing that requires some fundamental changes in the system. Which no one's really interested in, since the textbook companies make sure to divert a portion of their guaranteed profits into maintaining the corruption in the universities.
Yeah, I don't know how much things have changed, but I don't think I bought a single new book for my entire degree. Indeed, my professors would strongly recommend that we buy used, borrow or go to...
Yeah, I don't know how much things have changed, but I don't think I bought a single new book for my entire degree. Indeed, my professors would strongly recommend that we buy used, borrow or go to the library.
As of a few years ago, the books were basically a farce at the university my cousin went to in the US. They didn't bother to bind them. When you bought a book, you functionally bought a stack of...
As of a few years ago, the books were basically a farce at the university my cousin went to in the US.
They didn't bother to bind them. When you bought a book, you functionally bought a stack of papers that may or may not have been proofread, along with a very important sticker on the front with a one time use code that you used to assign yourself to a class on the vendor's website.
Sure, you could go rent the book, or buy used, but the one time use code was the key to the website that the university (or at least a good chunk of the professors) required homework to be turned in through. Having the book itself was inconsequential.
Are you talking about LLMs here or image generation? The article is about LLMs, where recreation of source materials is easier I think (didn't dig into it, feel free to correct me), but with...
and those who feel ripped off are calling bullshit since it can, with proper prompts, literally just recreate the original work in some cases, or has the signatures of the artists used in the resulting output.
Are you talking about LLMs here or image generation? The article is about LLMs, where recreation of source materials is easier I think (didn't dig into it, feel free to correct me), but with current generative image models I don't think this has ever been proven.
I know of one study that managed to do exactly that, however it was using an early unpublished research version of Stable Diffusion which used a 1 or 2 orders of magnitude smaller dataset that was not filtered for duplicities and suffered from overtraining, because some images were present in 100+ identical copies. Despite that it took iirc hundreds of thousands of attempts to generate one reasonably similar copy of the original image, and that was when they knew the perfect prompt to use in advance. The results did not apply even to the oldest public version of Stable Diffusion which did filter for duplicates and used x times more different images to train it.
Are you talking about LLMs here or image generation? The article is about LLMs, where recreation of source materials is easier I think (didn't dig into it, feel free to correct me), but with current generative image models I don't think this has ever been proven.
The following paragraph is (emphasis mine)
And that's just focusing on art, not video or text where it looks even worse. Can I just republish someone else's book with all the names changed and not be infringing? Probably not, but you can clearly use the AI to do exactly that, and so on.
Then you should back that up, because it's a pretty strong claim that as far as I know is not true. Literally recreating original work was afaik only shown in the study I mention (unpublished...
Then you should back that up, because it's a pretty strong claim that as far as I know is not true.
Literally recreating original work was afaik only shown in the study I mention (unpublished overtrained version and even there reproduction was exceedingly rare), and the only additional study I know of that used the first public version of Stable Diffusion claimed "copies" but in fact demonstrated only images that were similar in composition
You asked a question of @Eji1700 I was pointing out that your question was answered in the first sentence of the next paragraph in their post. It was not me to whom you were speaking, so I'll not...
Ah. You didn't add anything to the actually important part of the message, so my mind sort of assumed nobody apart from the author would see a reason to respond, my fault.
Ah. You didn't add anything to the actually important part of the message, so my mind sort of assumed nobody apart from the author would see a reason to respond, my fault.
What if I started a blog where I analyze and critique NYT articles? Suppose for each article I excerpt and respond to the text paragraph-by-paragraph. And let’s say I break up my content so each...
What if I started a blog where I analyze and critique NYT articles? Suppose for each article I excerpt and respond to the text paragraph-by-paragraph. And let’s say I break up my content so each article’s critique is in three parts, three blog posts, so no single post contains the full content of the article. Though arguably the full text could be reconstituted from my site.
I’m reading the NYT articles for free, legally. I’m monetizing my blog with ads. Is this legitimate? How is this scenario different from OpenAI’s, legally speaking?
The biggest difference is probably in the first factor of fair use, which is that an AI model is more transformative than direct quotations would be. It is plausible that somebody could read your...
How is this scenario different from OpenAI’s, legally speaking?
The biggest difference is probably in the first factor of fair use, which is that an AI model is more transformative than direct quotations would be.
It is plausible that somebody could read your blog instead of reading the original articles. However, it is not plausible that somebody could craft prompts to retrieve the latest NY Times articles. It is not a suitable replacement for the original.
See for example the judge's objection in the Sarah Silverman book case, who wrote: “There is no way to understand the LLaMA models themselves as a recasting or adaptation of any of the plaintiffs’ books.”
There are situations where a text is printed verbatim, typically when it's been overtrained due to duplication. Though that's a little more complicated, and generally requires prompt engineering to target those overtrained texts. In most cases, these models are wholly different products than their source material and are likely to meet the first (and generally most important) factor of fair use.
All that said, your blog may also be considered fair use under the first factor! Commentary and criticism are generally valid use cases as well. It's just for different reasons, in this case.
What if I started a blog where I read NYT articles and then paraphrase them. I'll change the text enough that it's not direct plagiarism word for word, but the sentiment and research is coming...
What if I started a blog where I analyze and critique NYT articles? Suppose for each article I excerpt and respond to the text paragraph-by-paragraph.
What if I started a blog where I read NYT articles and then paraphrase them. I'll change the text enough that it's not direct plagiarism word for word, but the sentiment and research is coming entirely from that one source that I'm not paying for.
I don't have an answer for this! It's, to me at least, a genuinely good question. If a human did what LLMs are doing in regards to processing text to respond with later, would you be able to get away with it? Is it fine if simply the words are different?
Fair use involves transformation - as long as the paraphrase fundamentally changes the product so as not to rob the desire for the original product then it's fine... In theory, you'd still...
Fair use involves transformation - as long as the paraphrase fundamentally changes the product so as not to rob the desire for the original product then it's fine... In theory, you'd still probably have to argue that in court though.
Now, if you added your own personal experience and minimally used direct quotes from the article then you'd have some ground, even if your 'experience' was 'long time NYT reader'.
Yeah I think if something like this went to court it would heavily depend on how transformative the paraphrase actually was -- so it would, like most fair use, be very case-by-case.
Yeah I think if something like this went to court it would heavily depend on how transformative the paraphrase actually was -- so it would, like most fair use, be very case-by-case.
IANAL but I would say its on the line at best. If I was the copyright holder I would make the argument that you didn't need to quote that much to make the point on any given article and that...
Is this legitimate?
IANAL but I would say its on the line at best. If I was the copyright holder I would make the argument that you didn't need to quote that much to make the point on any given article and that combined you've sampled every bit of the work.
I’m surprised at the number of upvotes you got here, this is a very piracy-friendly site. I get absolutely roasted any time I even suggest that piracy is maybe less than ethical.
I’m surprised at the number of upvotes you got here, this is a very piracy-friendly site. I get absolutely roasted any time I even suggest that piracy is maybe less than ethical.
I made no such suggestions, at least I didn't intend to. My point was "the establishment" goes after people tooth and nail for downloading entertainment without paying, BUT somehow Silicon Valley...
I made no such suggestions, at least I didn't intend to.
My point was "the establishment" goes after people tooth and nail for downloading entertainment without paying, BUT somehow Silicon Valley seems to be promoting a sense of entitlement that they can take whatever they want without paying. In other words, a double standard.
I'm not going dive too deep into this piracy talk but there's a distinct difference between a person pirating for personal use (for themselves and others), and a company/group (or an individual...
I'm not going dive too deep into this piracy talk but there's a distinct difference between a person pirating for personal use (for themselves and others), and a company/group (or an individual potentially) pirating for profit.
From the article This is all this topic ever boils down to. The statement from OpenAI about needing copyrighted material is technically neither surprising nor controversial. IANAL but based on my...
From the article
Previously, the company said it respected “the rights of content creators and owners”. AI companies’ defence of using copyrighted material tends to lean on the legal doctrine of “fair use”, which allows use of content in certain circumstances without seeking the owner’s permission. In its submission, OpenAI said it believed that “legally, copyright law does not forbid training”.
This is all this topic ever boils down to.
The statement from OpenAI about needing copyrighted material is technically neither surprising nor controversial. IANAL but based on my limited experience with fair use for my own art, it seems like AI models are OK here as far as lawsuits from content creators (assuming they're paying for access to the copyrighted material where applicable).
The open issue is if fair use or copyright law should be amended, and that seems more likely to play out in legislation than in the court.
It still seems to be not completely settled yet how to interpret this within fair use. But even so, I personally think it is fair to argue that it doesn't seem to be the original intent or...
It still seems to be not completely settled yet how to interpret this within fair use. But even so, I personally think it is fair to argue that it doesn't seem to be the original intent or "spirit" of the fair use term. It was made for things like being able to quote a book in a review, use small samples from movies in a song or whatnot. Before the AI models the use was limited in scope and impact. Now we have machines that just process practically everything every human has ever created and enabling big companies to resell and centralize it all for fun and profit. It is pretty far from the original goals of copyright to protect individual artists and their work.
I agree. In (my understanding of) the intent of fair use, party B wants to use the work of party A in some minor and creative/derivative way. The law wants to protect the ability for party A to be...
I agree.
In (my understanding of) the intent of fair use, party B wants to use the work of party A in some minor and creative/derivative way. The law wants to protect the ability for party A to be paid for their work, without limiting the creativity or potential for party B to create value.
Medium/long term AI obviously hurts artists as you said. But short term it's still early days and artists aren't seeing damage yet. So I think right now it does make sense for lawmakers to write out what we want to happen from here, rather than courts trying to make case law based on how the reality has caused the intention of the laws to radically deviate from how the law is written.
The issue is copyright is far too all encompassing and is fundamentally at odds with the way human culture reproduces itself. Nothing that goes into training these models is different from how...
The issue is copyright is far too all encompassing and is fundamentally at odds with the way human culture reproduces itself. Nothing that goes into training these models is different from how human beings learn about the same things. We just don't pretend that a copy of something in a human brain is infringing.
At the moment. I'm sure some lawyer at Disney just got a boner and doesn't know why.
I really wish this all ends with saner laws around intellectual property, but I have to assume in reality we'll be left with an even worse state than what we started with.
I really wish this all ends with saner laws around intellectual property, but I have to assume in reality we'll be left with an even worse state than what we started with.
I'm not sure I agree with this analogy. If a human views some art, it's only consumption. If a human is inspired to make a unique work based on the art, that is certainly legal (thinking about all...
I'm not sure I agree with this analogy. If a human views some art, it's only consumption. If a human is inspired to make a unique work based on the art, that is certainly legal (thinking about all of the Van Gogh style objects for sale as an example, even though copyright doesn't apply there).
If a human saves some identical representation of the art (screenshot or picture), even if only for personal use, that would technically be pushing the boundaries of legality. If a human used pieces of that saved image to create their own work, I also don't believe that would be legal, especially if they're profiting from it.
So my personal view on this is inspiration vs copy-paste. I suppose it's up for debate on which of these executions AI uses, but I'm inclined to believe it's somewhere in the middle (or perhaps a bit of both).
Another example could be a video game. Let's say I (legally, even) have access to the source code and assets from a video game. Would I be allowed to change the colors of some of the assets and release it as my own game? Even if I was, is that even remotely ethical?
This isn't really true. By and large, it can be legal if the resulting work is transformative; this is part of what fair use is intended to cover. Whether a specific case is fair use (and thus...
If a human used pieces of that saved image to create their own work, I also don't believe that would be legal, especially if they're profiting from it.
This isn't really true. By and large, it can be legal if the resulting work is transformative; this is part of what fair use is intended to cover. Whether a specific case is fair use (and thus legal) or not is decided on a case-by-case basis by the courts (and whether they're profiting from it or not is only a small piece of how this is assessed). But collage art and blackout poetry definitely aren't illegal per se, which this would imply.
I made a much longer comment elsewhere in this topic on whether AI training is fair use of the copyrighted training data. The short answer is we don't know yet because there are arguments either way and it hasn't been decided in any courts yet. But it is noteworthy that it exists when we draw comparisons to human creativity, even when we note differences between human and AI creation on both a legal and philosophical level.
In any case, I don't think that human inspiration and LLM training are analogous for other reasons. They're simply not that similar underlyingly. This is especially true when one understands LLMs on a technical level and the differences in how they acquire language compared to human children. But even without getting into any technical details, on the most abstract/philosophical level, a human mind exists without exposure to any particular inspiration. You could never expose a human to any creative work and they would probably still be capable of creating something inspired by their own experiences if nothing else. Whereas these models are comprised of what they've learned from this training data and do not exist without it. Even if using copyrighted works to train AI is totally fine, I think comparing it to human creativity betrays a poor understanding of both.
I was with you until the last paragraph, where I think you're falling into a common pitfall of indirectly underestimating AI by overestimating humans. Humans, like AI, are the sum total of their...
I was with you until the last paragraph, where I think you're falling into a common pitfall of indirectly underestimating AI by overestimating humans.
Humans, like AI, are the sum total of their experiences. The way these models work are fundamentally much more similar to how our brains work than even most people who have a better than average understanding of how the AI works tend to think, because they don't understand as much as they think they do about the way the human brain works. The similarities are, frankly, concerning. We're dealing with forces that are more powerful than the people working the most closely with them tend to realize.
Now, do we have a fully working general purpose AI brain yet? No, but we have a pretty good visual and language cortex for it.
A much better one than people who have a good understanding of AI but a weaker understanding of the brain, or vice versa, tend to think. We're making the building blocks of a true general AI and the people who should be most conscious of that are the most likely to dismiss it, because they know how simple the building blocks -- if not the connections they make -- are in the AI, but don't realize the same is true of the human brain. The simple fact that we don't really understand what these models do with the training data should be concerning enough, but there's this attitude that brushes aside that uncertainty about the emergent complexity of these systems just because the starting point is simple.
I honestly had to click through to see what I'd originally written because I largely agree with what you say here! The one thing I would push back on is the notion that humans are exclusively the...
I honestly had to click through to see what I'd originally written because I largely agree with what you say here! The one thing I would push back on is the notion that humans are exclusively the sum total of their experiences -- I don't think that's something that's really well-established on a scientific or philosophical level. Certainly the sum of our experiences is an absolutely huge chunk of what makes us who we are, but the nature vs nurture debate has not been solidly resolved as 100% nurture.
My point in that last paragraph was not to undermine the emergent complexity of these large models -- they are crazy interesting and complex, and I think we should try to take an interest in what structures emerge within them (despite how difficult it is to actually observe those structures, unfortunately... I'm very pro-AI explainability research!) But just because the human mind and AI both contain extreme complexity emerging from simple building blocks, that doesn't mean either those simple building blocks OR the resulting complexity work the same way underlyingly. Despite the name, neurons in a neural net are not actually so similar to human neuron cells -- they were inspired by the concept of how human neurons work, but the comparison only works on a high level and breaks down the more you know about human neurobiology (or so I'm told -- I've had to rely on other people who know more about that telling me so!)
But in any case, I think you're extrapolating past the point I was trying to make in my last paragraph, which doesn't rely on any argument about AI's fundamental complexity. It's merely that humans and their creativity can demonstrably exist without any relevant input data -- unfortunately, we know this for certain. The unfortunate Genie, a highly abused and potentially intellectually-disabled child who spent the first 13 years of her life without any exposure to language and being beaten whenever she mare noise, could still tell sophisticated stories using pictures. This stands in contrast to any ability an LLM has to tell a story or a visual model has to create art, which is 100% based on their training data (and indeed, requires FATHOMS more of it than a human does to learn the same skills), even if it results in fascinating emergent complexity.
I don't say this to diminish how capable and complex these models are, but just as an example of ways in which human creativity differs from the output of an AI underlyingly -- there may well be something deeper underlying human creativity that extends beyond our own internal equivalents of a language model. I'll leave studying that up to those in that field. Perhaps someday we'll make a more general AI that models cognition in a way that parallels this, allowing it to create even when it has been deprived of input. But I don't think it's disputable that current AI models are not like that.
IANAL, but I like to poke at these things as if I am self-defending. Would love to have a real lawyer evaluate my thoughts. From what we've seen so far....they're not. They're banking on fair use...
IANAL, but I like to poke at these things as if I am self-defending. Would love to have a real lawyer evaluate my thoughts.
(assuming they're paying for access to the copyrighted material where applicable).
From what we've seen so far....they're not. They're banking on fair use to shield them from needing to pay for the copyrighted material, the way it was granted for search engines.
My understanding is that copyright law by and large is a creators-first law: Unless covered under exemptions like fair use and first sale, the copyright holder has all the rights. It doesn't forbid training, but it also doesn't allow training. Hence why they're banking hard on fair use...it's the only way they could get away with it without purchasing at least 1 copy of everything.
The real critical thing: Fair use in one context does not necessarily translate to fair use in another context. The four main principles of fair use (in the USA) are the primary determining factors, and while case law is often cited....everything is fungible. And it's entirely possible that we may find text-based generation fair use while we find image and audio generation not fair use. That's how messy copyright is. Below are the 4 principles, with my commentary on how I think they'll be applied. Bold is their emphasis, italics is mine.
Purpose and character of the use, including whether the use is of a commercial nature or is for nonprofit educational purposes: Courts look at how the party claiming fair use is using the copyrighted work, and are more likely to find that nonprofit educational and noncommercial uses are fair. This does not mean, however, that all nonprofit education and noncommercial uses are fair and all commercial uses are not fair; instead, courts will balance the purpose and character of the use against the other factors below. Additionally, “transformative” uses are more likely to be considered fair. Transformative uses are those that add something new, with a further purpose or different character, and do not substitute for the original use of the work.
So, this is certainly a commercial use, making it much less likely to be fair use. However, I'd say it's definitely a transformative use, which makes it more likely to be seen as fair. I think it will be down to the part I have highlighted: whether or not it substitutes for the original use of the work. And I think there's a solid case to be made, especially by the New York Times, about how an AI trained on its data could reduce demand for NYT copyrights.
Diverting to art models for a moment: If the training data consists only of copywritten cartoon frogs (CCF for short), and someone says "give me a cartoon frog", that almost certainly reduces the demand somebody might have had to license a CCF. While traditionally, the original artist would have had to show direct harm of that directly to their work....the companies arguing that they need everything might give an opening: Since they need all CCF, any generation of a CCF reduces demand for the collective input CCF. And expanding that to the likenesses of actors and musicians (and the works they create)... there is a messy unsolved space that OpenAI may well have left an opening to. That would be how I would try to attack if I was self-defending...and I'm betting the NYT legal team is gonna as well.
Nature of the copyrighted work: This factor analyzes the degree to which the work that was used relates to copyright’s purpose of encouraging creative expression. Thus, using a more creative or imaginative work (such as a novel, movie, or song) is less likely to support a claim of a fair use than using a factual work (such as a technical article or news item). In addition, use of an unpublished work is less likely to be considered fair.
This one is almost a direct strike against OpenAI (and others), and I think they've basically just conceded it.
Amount and substantiality of the portion used in relation to the copyrighted work as a whole: Under this factor, courts look at both the quantity and quality of the copyrighted material that was used. If the use includes a large portion of the copyrighted work, fair use is less likely to be found; if the use employs only a small amount of copyrighted material, fair use is more likely. That said, some courts have found use of an entire work to be fair under certain circumstances. And in other contexts, using even a small amount of a copyrighted work was determined not to be fair because the selection was an important part—or the “heart”—of the work.
So, we know they consume everything. That on its own is worth mentioning, but we know they generally won't spit out an entire copywritten work. But the "Heart of the Work" aspect, as italicized, could well be a downfall, if the eventual (not necessarily current) models will be able to spit out assorted "hearts of the work" at the will of the user.
Effect of the use upon the potential market for or value of the copyrighted work: Here, courts review whether, and to what extent, the unlicensed use harms the existing or future market for the copyright owner’s original work. In assessing this factor, courts consider whether the use is hurting the current market for the original work (for example, by displacing sales of the original) and/or whether the use could cause substantial harm if it were to become widespread.
And this one, particularily the highlighted bit, is one that is very interesting. I'd say its reasonable to say that it's probably not harming existing market (for now).... but I do think it potentially has massive impact for the future work. It's easiest to think about this for musicians and actors. If someone can just spit out Matt Damon's voice saying whatever they want, that very much harms Matt Damon's future prospects for work. Since the models would not be able to imitate Matt Damon without consuming works Matt Damon certainly has some degree of copywrite claim on, it follows that he may be entitled to royalties for those source works whenever someone prompts for a clip of "Say this like Matt Damon".
And again, because this is now taken at the aggregate (because they can't do this without it all), the question I would posit to the court: Does consuming all copywritten works reduce the demand for any copywritten works? Because if so, there is provable harm, it just becomes hard to prove whom is being harmed and by how much.
The open issue is if fair use or copyright law should be amended, and that seems more likely to play out in legislation than in the court.
Fair use is almost entirely court proceedings. Law being amended might be legislation, but that's not gonna happen until a lot of court proceedings shake out.
Somehow, I suspect, all the people eager to flame AI and see it banned, would still want it banned even if it could be 100% documented the model was 100% open source. It is not a violation of...
Somehow, I suspect, all the people eager to flame AI and see it banned, would still want it banned even if it could be 100% documented the model was 100% open source.
It is not a violation of copyright to read something, consider and think about it, read other works and consider/compare/think about them in growing total, and then (sooner or later) use your conclusions to create something of your own. Replace "read" with "draw", "paint", "perform", "sing", "compose" or whatever else you want.
You can't ban technology from evaluating and analyzing work (be it written, drawn, painted, sung, whatever) without also banning that for humans.
I asked before, and no one responded. Write the law that says "technology can't analyze art." Go ahead, someone write that law. Let's see the text of the law. But ... the law can't make it illegal for a human to consume art.
And it's not as easy as saying "technology bad, human good."
What if I run a search algorithm through the text of a book? Did I just infringe, simply because I wanted a word or phrase distribution or some other data array generated? What about search engines, do they infringe when they analyze text, images, sound? What about an ereader or a web browser; do those infringe?
What about a professor, or a self-learning artist seeking to improve her craft and her understanding of it; do they infringe when they sit down to read (or view, listen, whatever) many, many, many works in the artistic field they're most interested in? Does it mean they can do that, but only longhand on a pad of paper? What if they want to use Excel? Or some other software?
Does the housewife who loves romances, and dreams of writing one of her own and becoming the next Danielle Steel, infringe when she obsessively reads every romance she can get her hands on? When she takes detailed, meticulous, indexed notes of everything she can figure out about the books? She tracks every detail she can about characters, settings, about pacing and conflicts. Monitors how long the books spend before the would-be couple meets, when they first fight, how long they stay broken up, how close to the end before they get back together?
What if she does all that on a computer? What if she has the computer help her search for keywords or phrases? What if she's a programmer by day, and codes up some analysis routines to help her learn just how these other romance writers did it?
Are the stories she writes, striving with every bit of mental acuity she can bring to bear to emulate what she's learned and concluded from her research, infringing on those she draws upon?
And she can do all of that without paying the authors a single cent. Libraries. She could be one of the best customers of her local library, and those authors would never have gotten any payment from her while she reads and researches their books.
Hmmm, I wonder, can an AI researcher check books out? Can an AI researcher take a camera to the local museum that allows cameras? A rock concert? A spoken word performance? Can an AI researcher hook their computer up to the same internet feeds you and I use, daily? I can read the New York Times if I want, same as you. I can scroll through pictures and anything else available on the internet.
So can the computer.
What I can't do, what you can't do, what the computer can't do, is sell copies of what it reads (or views, listens to, whatever). That's a violation of copy right. The right to copy. But that right only applies to copies. Exact copies. Identical copies. Not "sounds like" copies, not "kind of similar" copies. That exact story, that exact image; that's what the copyright's on. A specific tangible work.
But ... I can do exactly what that housewife does. So can you. So can anyone, including the computer. I can read my favorite genre until I can quote the top twenty books by heart. And rattle off the specific details of the next eighty to round out the top hundred. Then I can sit down and write my own story in that genre. And, if they're words I came up with, words I generated and put together in the order I did on the page ... it's my story.
It has no legal weight or relevance that I sat there with those other authors' stories strong and fresh in my mind while I wrote. The story I wrote is one I wrote. The computer can do the same thing; generate words that form a story. Or put pixels together in a 2 or 3 -D space to create an image, a sculpture. Notes on a field to create music or a voice.
Name an artist who learned in isolation. Who never, ever consumed someone else's art. Who never looked at another painting, listened to someone else's song, who never read stories. An artist who just sat down in the void they inhabit, and busted out art.
Oh wait, all artists consume art. The art of others. That art influences them. Well gee, those assholes must be stealing from those other artists. They should be sued.
AI does nothing humans don't do already. It just doesn't.
Unless she pays for it or otherwise legally attains it via first-sale rights (ala libraries or used book sales)...yes. If she scans that physical book in, then sells or returns that book, that...
, infringe when she obsessively reads every romance she can get her hands on?
Unless she pays for it or otherwise legally attains it via first-sale rights (ala libraries or used book sales)...yes. If she scans that physical book in, then sells or returns that book, that scan is a violation. If she rents a digital book from the library....the library is paying that license on her behalf.
The computer can not check out books from the library. It does not have the ability to, of its own volition, apply for a library card. The AI researcher could, but transcribing them into another format or breaking the DRM to have it added to a data model would be a violation.
These are not intelligent beings. They are computer programs, doing exactly what their programmers told them to do. It was the programmers (OpenAI) whom potentially violated copyright...nobody in their right mind is arguing that the LLM itself is doing it.
OpenAI (almost certainly) took a bunch of illegally obtained books and fed it to an LLM. OpenAI is claiming this is fair use. It's not fair use if I download books3 and read nearly every copywritten book in existence.
It's fair use for Microsoft to crawl the NYT to index everything. It's questionable if these perfect copies, being used outside the original scope of providing search results, is fair use.
I suspect that OpenAI is saying this to set up their fair use arguments. In particular, arguing that the public interest in the existence of LLMs out weighs the harm to rights holders. Here is a...
I suspect that OpenAI is saying this to set up their fair use arguments. In particular, arguing that the public interest in the existence of LLMs out weighs the harm to rights holders.
Here is a long'ish article going into some of the legal details of fair use with AI training data.
I also suspect that future legislation will deal with these issues eventually. I believe the EU is looking at laws that would allow training on copyrighted materials for research and education, but not commercially, though who knows how things turn out.
Personally, I'm skeptical of OpenAIs claims, and how they have downplayed the extent to which these parameter sets can act as a sort of compressed version of the data. I don't see why, on its face, they should be able to keep a complete encoded store of other people's works against which to charge money for queries by an inferential algorithm. Even interpreting the technology as generously as possible, it is charging money for queries against a very detailed map of latent variables expressed in the works of others, without any prior consent.
Coupled with the fact that OpenAI almost certainly spoofed search engine indexer user agents to get the data in the first place to bypass paywalls, it's hard to argue that they have clean hands in this.
Public interest has been thrown to the wayside by rights holders through special interests copyright legislation. In the last century it's more commonly associated with Disney, though there were...
In particular, arguing that the public interest in the existence of LLMs out weighs the harm to rights holders.
Public interest has been thrown to the wayside by rights holders through special interests copyright legislation. In the last century it's more commonly associated with Disney, though there were other copyright laws in the past that were for people with special standing, wealth and privileges etc. At one point in the past a copyright was 14 years with an option at the end to extend it for another 14 years and that was it. The whole point of copyrights should be public interest, not rights holders.
Copyrights should be something like 20 years give or take some years if you want to argue, not what it has become now. I'd find it interesting what an LLM could and couldn't do if the training data was 20 years old. It would certainly narrow the scope of what OpenAI could argue is fair use for this use case if our copyright legislation wasn't so totally broken that they don't happen to have a valid point behind their argument even if their usage may not necessarily fit the spirit of it.
The logic behind patents being 20 years and copyrights being pretty much longer than most individuals lifetime is mind boggling to me. The ability to copy something for free, at no cost to anyone, that should be seen as a massive win for humanity. Imagine if we could replicate food at no cost, just whatever amount of food we needed. Obviously if you consider externalities such as it might increase overpopulation even more then it's a possible downside, but my point is that the actual ability to do it would be a miracle. You could solve for the external factors in other ways. Simply denying people the ability to do it because it might prevent someone else from making gobs of money farming is just nonsense.
Now when it comes to something that isn't seen as necessary for survival, like various intellectual "properties", then sure it makes sense to balance public interest now and public interest in the future, because that's what it is in the end, it's all public interest. Restricting someone from doing something that doesn't harm anyone needs to be done for public interest. We're using public resources (the government) to enforce such laws. Only in an extreme capitalistic worldview would people assume there's some inherent right to monopolize something to extract maximum profits regardless of whether there is a public interest in it or not. Limited copyright terms are in the public interest, but lifelong or century long copyrights are not.
I think it would be difficult for OpenAI or others to build the same level of product in a world where copyrights are 20 year terms, they'd be going back to pre-2004 mining of forums and what not to build a base dataset which in many ways wouldn't be super useful today, but it would give more than enough data to make the system functional and logical, meaning it can string together coherent sentences still even without copyrighted data, and it would potentially make it so that paying for specific data sets would actually be feasible because they have a 20 year gap to fill to make it more relevant.
This is a really interesting point. A lot of the dialogue is around creators and their rights: an artist is entitled to their work for life (plus additional years), and OpenAI is violating their...
This is a really interesting point. A lot of the dialogue is around creators and their rights: an artist is entitled to their work for life (plus additional years), and OpenAI is violating their rights by using their work without permission. But are those rights good for society as a whole?
(I suspect no. Ideally, we would return to short copyrights, but I don't have much faith that Congress will do that.)
But supposing we did have short copyrights again… Personally, a 20-year-old LLM would still be tremendously useful to me: maybe it wouldn't know the latest slang, or what a "smartphone" is, but it would still know enough of the English language to function as a writing assistant.
Copyright creep and questionable lawsuits are absolute poison to music. Music has fundamental laws, and concepts like genres depend on intentional similarities. But we now have high profile...
Copyright creep and questionable lawsuits are absolute poison to music. Music has fundamental laws, and concepts like genres depend on intentional similarities. But we now have high profile lawsuits over things like chord progressions and ostinatos obvious enough they exist in classical music.
In the 80s and 90s, the advent of sampling hardware created an explosion of creativity (the dawn of house, techno, hip hop, etc) that was rapidly clawed back by lawsuits over musicians doing what music is about: respinning ideas into new ones, the same as any form of communication. And we are that much culturally poorer for that blatant theft from us that the chilling effect of copyright created.
Copyright gives a few a fiefdom and steals rights from everyone else. Your basic right to expression is being infringed upon, so the 1% of artists (and their grandchildren) can rent-seek forever.
On a similar train of thought: Has OpenAI (and the protectionism of its inevitable profitability) done more to damage the copyright and likeness rights than piracy ever has? And what is the real...
But are those rights good for society as a whole?
On a similar train of thought: Has OpenAI (and the protectionism of its inevitable profitability) done more to damage the copyright and likeness rights than piracy ever has? And what is the real fallout of that damage?
I think there are a LOT of things that are bad about current copyright law in ways that benefit huge corporations more than small creators whether the huge corporations are acting within the law...
I think there are a LOT of things that are bad about current copyright law in ways that benefit huge corporations more than small creators whether the huge corporations are acting within the law or not. I'd honestly be for a HUGE opening-up of copyright law if not abolishing it altogether. But I think when it comes to cases like this, people are mostly speculating on how they'll turn out and their impacts on the current landscape, which relies on current copyright law and its interpretations.
Right; my entire post was speculating about how things will play out with the courts with the laws as they are today. Not that I'm opposed to a philosophical conversation that other replies seem...
Right; my entire post was speculating about how things will play out with the courts with the laws as they are today. Not that I'm opposed to a philosophical conversation that other replies seem to want, but how I think things should be in an ideal world is not the same as how I think things will play out in case law.
People also seem to love being obtuse with my user agent reference rather than the statement of intent to bypass content protections. 😂 Internet's going to Internet.
I suspect this will be a contentious stance, but by no means would I consider user agent spoofing to be “dirty hands.” There’s nothing unethical about crafting your own HTTP headers.
I suspect this will be a contentious stance, but by no means would I consider user agent spoofing to be “dirty hands.” There’s nothing unethical about crafting your own HTTP headers.
Intentionally bypassing paywalls to harvest others content for commercial use doesn't sound clean to me. Spoofing your own for non commercial purposes seems entirely different.
Intentionally bypassing paywalls to harvest others content for commercial use doesn't sound clean to me.
Spoofing your own for non commercial purposes seems entirely different.
Eh. User Agents have been considered completely unreliable for a long time. There’s enough crawlers reporting as users and users reporting as crawlers and other misrepresentations at this point...
Eh. User Agents have been considered completely unreliable for a long time. There’s enough crawlers reporting as users and users reporting as crawlers and other misrepresentations at this point that it’s a lost cause.
Chrome the browser, for example, reports itself as:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Then there’s all of the user-initiated changes with privacy extensions or manual modifications.
Whatever the means used, the point is that intentionally bypassing paywalls to copy content for commercial purposes doesn't exactly paint them in the best light.
Whatever the means used, the point is that intentionally bypassing paywalls to copy content for commercial purposes doesn't exactly paint them in the best light.
They're wrong. The MIT license is the most popular software license in the world and the majority of projects on GitHub are licensed under it. Content on Wikipedia and the Stack Exchange network...
“Because copyright today covers virtually every sort of human expression – including blogposts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials,” said OpenAI in its submission, first reported by the Telegraph.
They're wrong. The MIT license is the most popular software license in the world and the majority of projects on GitHub are licensed under it. Content on Wikipedia and the Stack Exchange network is CC-BY-SA, and images on Wikimedia Commons are public domain. These all permit usage by essentially anyone for any purpose, so long as modifications to the content are propagated under a compatible license. And these are massive datasets.
What OpenAI has done to get a hand on their datasets is explicitly in the wrong: Twitter, Instagram, the New York Times, Getty Images, etc all have provisions in their terms of use stating explicitly: "you are not allowed to scrape content without prior consent [of the company]". The legal backing of "Terms of Service" on websites I do not know, but OpenAI very very much broke them.
Tangentially, I find it to be such a shame that much of the individual pushback against these soulless corporations sucking up people's life work for profit has focused on reinforcing our existing...
Tangentially, I find it to be such a shame that much of the individual pushback against these soulless corporations sucking up people's life work for profit has focused on reinforcing our existing bullshit intellectual property laws rather than accepting that they are harmful and cut us off from our own culture, and not put thought into shattering the copyright system and embracing the better, already successful ways to fund art and artists (Patreon, Bandcamp, Itch.io, and the like).
Well, that's a long time coming since we transition from a morality based society to a law based society eons ago. Weasel words like "but is this crime, really" or "does it really say...." or...
Well, that's a long time coming since we transition from a morality based society to a law based society eons ago. Weasel words like "but is this crime, really" or "does it really say...." or "surely it doesn't include ...." Probably since the time we could reach for fruit that isn't allowed.
What we COULD do, though, is to fix existing laws to make them less BS, and I believe that is being done all the time :)
(False dichotomy alert)
The alternative is a society that has super vague "laws" like "you can't performs actions that may hurt the country" or "it's illegal to act on what may be foreign influences": they are so broad they can only be enforced by fear and bullying - the fear is part of it all the time - everyone could be guilty at a whim because "you should know better" even if it's not written down, it is a guilty verdict.
Image if any other defendant tried to justify their infringement of a plaintiff's property rights on the basis that the infringement was "necessary" so that the defendant could secure a commercial...
Image if any other defendant tried to justify their infringement of a plaintiff's property rights on the basis that the infringement was "necessary" so that the defendant could secure a commercial advantage!
Tech companies are really really good at rhetoric and dressing old ideas up as new ones, which helps to mitigate negative pre-conceived notions people hold about the old ideas.
Your honour, only the tiny hands of malnourished orphans working in dark caves can weave these rugs with such intricacies: it's a necessity literally woven into the fabric of these rugs. Feel how...
Your honour, only the tiny hands of malnourished orphans working in dark caves can weave these rugs with such intricacies: it's a necessity literally woven into the fabric of these rugs. Feel how luxurious this one is - it blinded seven orphans, afterall. Or do you prefer this one which blinded ten? (Idea from Margaret Atwood's The Blind Assassin)
What rubbish. If it can't exist without violating the rights of others then go back to square one and start paying for training datasets. It'll solve all kinds of problems as well, like naked Asian woman, hate speech bias, and you could even properly do hands if you have a large dataset trained especially on paid model hands with tracking dots and tagged with specific gesture, angle and light settings.
https://huggingface.co/Mitsua/mitsua-diffusion-one is a generative AI model being trained from scratch using only public domain and voluntarily opted-in material. It's still a work-in-progress and...
https://huggingface.co/Mitsua/mitsua-diffusion-one is a generative AI model being trained from scratch using only public domain and voluntarily opted-in material. It's still a work-in-progress and currently it's not yet very good, but I think it's great that they're at least trying to create an ethical AI art dataset.
I was in one of the closed beta tests of OpenAI DALLE2. During onboarding we were told that the AI art generator was strictly "for research and non-commercial use". A few short months later, it totally became a commercial product.
Corporations using the "fair use" defense to take creative work en masse, without consent, and then use it to create commercial products that then displace the very creators of said creative work, isn't very "fair use" at all. Fair use is meant to benefit the public. But what's happening with AI is the reverse-- it's corporations extracting value (data, creative work) from the public, then reselling it back to the public and privatizing the profits.
"To the extent that this AI derives value from its input data, it engages not with mere facts about copyrighted materials, but instead with the protected, expressive aspects of those materials. Such an expressive purpose does not make a use per se non-transformative. But it does make the rationale of non-expressive fair use unavailable."
In the same paper, he wrote about the distortion of fair use in today's digital economy:
"Today’s digital economy upends this narrative. … This pivot in market dynamics should prompt a corresponding shift in attitudes towards fair use. The doctrine no longer redistributes wealth from incumbents to the public; it shifts wealth in the other direction, from the public to powerful companies."
AI that is trained on the commons should give back to the commons as well. I would support laws requiring commons-trained AI tools to be free and open-source.
I would also support laws requiring commercial, corporate-owned AI to properly license material used for training. If corporations consider copyrighted work to be valuable enough to be necessary for training, then they're valuable enough to pay for. This would respect artists' consent, but also privilege corporations that already own large IP collections. So I would also like to see copyright law adjusted to prevent corporations from hoarding IP, especially for a long time after an artist's death, as they do now. The default copyright term can also just be shortened in general. I don't see why it needs to be so long after the artist's death, since copyright's main purpose should just be incentivizing and rewarding the artist.
Going beyond copyright, I would support laws strengthening our rights to data privacy, and reversing the current trend of corporations harvesting our data and using it in ways that maximize their profit, to the detriment of the public good. Here's an article about how data harvesting perpetuates systemic oppression. For example, RELX’s LexisNexis products have been used to surveil protesters and immigrants. Btw, Stable Diffusion's lawyer is Mark Lemley, co-founder of legal data analytics company Lex Machina which was acquired by LexisNexis in 2015.
IANAL, just an artist with personal experience pursuing infringement cases.
Often what's considered impossible today becomes possible in the future. This is a pretty childish and laughable defense. Hope NYT wins. Worth mentioning the whole conversation around LLMs seem to...
Often what's considered impossible today becomes possible in the future. This is a pretty childish and laughable defense. Hope NYT wins.
Worth mentioning the whole conversation around LLMs seem to neglect the 50+ year history of technologies being labeled as "AI" in their early stages and later not being considered AI after becoming well-trotted and better understood (e.g. speech recognition, natural language processing)
Sorry to necropost, but I feel like I’m alone in my opinion: I don’t really care. I think the existing copyright systems for the largest part don’t actually benefit society, just the copyright...
Sorry to necropost, but I feel like I’m alone in my opinion: I don’t really care. I think the existing copyright systems for the largest part don’t actually benefit society, just the copyright holder, so to me all the discussion about if it’s legal just isn’t interesting. Is it moral? That’s certainly a question to answer, but in my mind, if it furthers development of a technology like this I’d be inclined to give it a pass. For research, I think this should be allowed.
And an additional thought: even if it’s not legal or moral, people had to have known that it could happen. I mean, the internet is open to anyone.
Pay for it.
You know, it is impossible for me to entertained with DRM entertainment, therefore pirating should be legal. /sarcasm
We should just expand the library system dramatically! Then the question would be whether or not an AI can get a library card
It would need to have a proof of residency, at a minimum. For me that was an ID or utility bill.
And the AI has to follow the rules of its local library: Only a handful of digital and maybe 20 physical books at one time.
If a company can be a person, surely an AI can be one too
Isn't that specifically for one thing like taxes or something?
AI already can't hold a copyright... there are a whole bunch of legal presidents about to be set in the next few years
The original case was about the speech rights of corporations (specifically, political speech). Off topic, but for context that decision (Citizens United v FEC) led directly to the emergence of Super PACs and a subsequent tsunami of "dark money" flowing into politics.
The concept of corporate personhood in American jurisprudence is much older then Citizens United, going back at least to Santa Clara County v. Southern Pacific Railroad Co. (1886) or arguably to Trustees of Dartmouth College v. Woodward (1819).
It certainly isn't for everything. I don't think corporations have mandatory schooling in their formativt years for one. Learning addition and all the countries, etc
I know this is said in jest, but I hadn't considered it. Thanks for another new dystopian nightmare.
...is it legal to scan library books while you've borrowed them and save the results? I'd kinda just assumed it wasn't tbqh.
In the UK at least, you are allowed to copy a small amount one chapter or up to 5% of a work as long as it is non-commerical
IANAL but as someone who has done a lot of amateur studying of this area of law, I don't think your presentation of AI training as definitely fair use is good. AI training may or may not be fair use; it depends both on the specifics of an individual case as well as on legal precedent that has not been set yet. At the moment we can speculate and argue about whether AI training is/should be fair use, but anyone who claims it is or isn't with 100% certainly is betraying a poor understanding of the situation, because it absolutely is not 100% clear that AI training is or isn't fair use legally.
The portion of the law you quote directly precedes the list of four factors that judges actually use in analyzing whether something is fair use. The purposes listed there are indeed purposes that are intended to be protected by fair use, and that's accounted for in the first factor:
The law is actually explicit here in saying that the commercial nature of a use is part of the first factor in a fair use analysis. It's definitely not the only part, and it's perfectly possible for something to be commercial and still be fair use. But it's certainly not completely irrelevant. Furthermore, even ignoring the commercial aspect, even noncommercial works won't necessarily always be fair use, because purpose/character of the use is only one of the four factors. Even if the commercial factor is considered irrelevant because AI training is research (and it certainly is research, though as someone in the field with a linguistics background I think "research into the underlying nature of language" is... overselling it), it's possible to still not be fair use even if first factor weighs in favor of fair use. It is one of four factors for a reason.
The second factor is:
and I think the section on this factor on Columbia University Libraries' page on fair use can be illuminating when it comes to how this can conflict with the first factor (emphasis mine):
I wanted to emphasize that particular sentence because it shows how the purpose and character of the use can be wholly educational, but it would be unlikely to be fair use. Of course the actual weight of this factor for or against fair use depends wholly on the work being used, not on how it's used, so it's impossible to make broad strokes statements on this factor for AI training. It'll depend on what's being used and how it was acquired.
The third factor is:
Of course, AI training almost always uses the entirety of a given work. I suppose you could argue that use of the whole work is necessary for their purposes. A general rule of thumb is that you can use as much as is needed for your purposes, even if it's the whole work. But I'm still not sure that would fly here -- one could argue that unlike with criticism and satire, which is usually targeted at a specific work and thus must use portions of it, AI training like this doesn't require the use of any one specific text, just large amounts of texts in aggregate, and therefore for each individual work it doesn't strictly need to use any of it. Not sure whether that argument would fly in court either. But in general I think this factor weighs pretty clearly against fair use for AI training. Which is fine -- it can still be fair use if the third factor weighs against fair use! There are lots of uses we don't want to forbid that use the entire work, and this is the reason we have four factors in this analysis.
The fourth factor is:
This is the factor that's probably going to actually get litigated the most (both in the court of public opinion as well as in actual court once it gets there), and it's also almost definitely where the commercial element is going to play a bigger role. I think it varies a ton based on the particular work in question. I think fiction writers have a pretty weak claim here (no one is using AI models like these to replace reading/buying their books) EXCEPT for in the case that their books were pirated rather than purchased for training. To quote Columbia University Libraries' page again:
That's not to say this definitely invalidates that factor for those books, but I think the piracy does definitely weaken OpenAI's case with regards to this factor (and also maybe the second factor as well; IANAL so I'm not an expert on where that part of things would fall in a judge's analysis). I think a nonfiction or reference work, such as an encyclopedia, is far more likely to have a case that the resulting LLMs are being used commercially in a way that will negatively affect the market for their original works. But I think visual artists have an even stronger case when it comes to this factor -- it is VERY obvious even to a layperson that AI art generators will have huge affects on the market for visual artists.
However, actual analysis of the other three factors isn't even the biggest problem with confidently declaring that AI training is fair use. The biggest issue is that fair use analysis is something that assessed on a case-by-case basis. The specific facts of a case have a HUGE effect here, and hopefully seeing the factors laid out makes it clearer that even for a specific AI mode it's possible that some of the use was fair and some wasn't. Moreover, we can't even try to generalize from other AI training cases that generally judges consider it to be fair use, because we don't have any court decisions in which a fair use analysis was done on AI training yet. As someone who works in AI and has a strong amateur interest in copyright law, I'm super excited for when we do get one! But we don't have one yet, and that means it's not really possible to make even generalizations about how AI training is seen when it comes to copyright and fair use. By contrast, we do have precedent on some other AI-related copyright issues (such as the inability of the output of generative AI models to be copyrighted, which is very well-established already).
Maybe judges will find that OpenAI's use of copyrighted works for training data was 100% fair use all the way! Even if that happens, it will only potentially inform future lawsuits against AI companies for doing similar things; it will still absolutely be possible for another instance of AI training on copyrighted work to be found to not be fair use, because the individual facts matter so much in fair use cases. Without any clear precedent on this issue to point to, it is inaccurate to confidently claim that AI is considered fair use. Maybe you think it is! And maybe you'll later turn out to be right! But even a lawyer or judge shouldn't frame it as anything more than their professional opinion until it's actually been decided in court.
I think we've had about as much productive discussion on the copyright side of things as two non-lawyers with different opinions on a fairly novel legal question can tbqh! Aside from agreeing to disagree and seeing what happens, of course.
As for whether LLMs offer much insight into the fundamental nature of language, my background prior to getting into NLP was in theoretical linguistics. Language models are super useful for research about language, but I think from that acientific background it doesn't teach us anything fubdamental that we don't already know. It's useful from a perspective of gathering and analyzing data, but when it comes to the Big Questions, they've either already been answered or aren't really addressed by this type of AI. It can help us study human language use, but it doesn't really illuminate anything truly fundamental about How Language Works that we don't already know.
If AI used some sort of explicit underlying structure in how it analyzed language, that could be used to argue for a theory of either syntax or psycholinguistics or both that maps closely onto it. But since these models are built to rely purely on statistics (which is pretty much every model for a long time) and the models themselves are black boxes, we can't really do this. Even if we were able to do this, though, it's not necessarily the case that the way these language models underlyingly model syntactic structure is the same way human beings do, so even that would certainly not stop the squabbling on how syntax in human language is truly underlyiny structured.
The biggest thing these models have done in this regard is hugely validate the idea of distributional semantics, since its principle is foundational to pretty much every remotely modern languahe model. But even that isn't really teaching us anything fundamental -- that linguistics theory started in the 1950s and inspired those approaches to language modeling rather than the other way around. So while it's super cool to see how thoroughly the concept has been validated by LLMs, it's not something that imo teaches us anything new and fundamental.
When I think of questions that advance our fundamental understanding of language, my mind goes to questions about the existence and nature of the human language faculty and whether certain features/structures are so fundamental to human language that no natural language will occur without them. These are the sorts of things that are big, fundamental questions in linguistics that are largely unsolved and very controversial. But language models are more or less useless in studying these.
I don't want to downplay the validity of research into language models, fwiw. It's great stuff and these models can be practically super useful in conducting theoretical linguistics research, in addition to their myriad other practical applications. But most of their utility when it comes to answering fundamental questions about language comes from using them as a tool in studying human language, as the models by themselves mostly reflect what we already know or are black boxes.
I hope this comment was somewhat coherent and interesting... it's a bit late at night my time so let me know if anything is confusingly written and I'll try to clarify when it's daytime for me 😅
This just doesn't seem morally right to me. For research purposes feeding LLM bunch of research material of course is fine. But when the product is "a thing that can algorithmically mimic anything because it ate the internet", your product isn't the algorithm, its the material it has consumed and it seems just wrong not to recognize that.
I guess the courts will decide this free use argument for us, but man it just seems bad for society as a whole to disincentivize the creation of original creative works, original reporting etc. But then again you can argue that this also creates some new works and can streamline the process when used as a tool.
We probably have agree to disagree but there's a little misunderstanding:
I'm not saying these AI models should be made illegal or denied of free use, like I said on my previous comment using them as tools for research purposes seems totally fine to me. What I mean is that there's a real difference between selling a tool that's able to analyze, interpret and reorganize data and the currently available services that are actually selling the meta data (or just data, I think it's kinda semantics) without you having to do anything but to prompt it.
I might've explained this poorly but hopefully you understand the difference (I think there is)?
Just wanted to fix that misunderstanding, other than that I think I understand your argument and just don't agree.
You may be interested in info law scholar Ben Sobel's commentary and paper regarding the inadequacy of the fair use doctrine in handling AI art issues. He wrote about the difference between "non-expressive uses of AI which do not infringe copyright" (for example, Google Books), vs. "expressive, market-encroaching uses of AI which may infringe copyright" (for example, generative AI models). While the former "promulgates information about a work", the latter "usurps that work's expressive value" and could "diminish the demand for the human-authored works on which it trained". He explains that this distinction is relevant to "the two most important statutory fair use factors: factor one, the purpose and character of the use, and factor four, the effect of the use upon the potential market for or value of the copyrighted work.":
I think we need to look at the purpose of copyright. The basic idea is to give a limited "right" to reproducing ones work, so that there is an economic incentive to create stuff. But to avoid that this "right" being used to limit others creations, we have "Fair Use" or "Fair Dealings" which grants exceptions. The basic idea in all of this is a system which stimulate creation.
So, if a graphic artist, who have spend a lifetime mastering his craft, can have his style analysed so that anyone can create pretty much the same in a matter of seconds, this does stimulate creation. A person making a tabletop game can give it superb graphic far beyond what his economy would normally allow.
The flipside of this is that there are no longer any actual value in learning the craft. So fever would bother, because what's the point. So this does hinder creation.
So the final verdict ... who the heck knows. But it doesn't really matter, because the tech is out there, and I don't think regulation can do much about that.
Seriously. OpenAI has a market cap of $80 billion.
Traditional established modestly successful authors may receive an advance of $1k-10k for their works.
So let's say OpenAI puts $10 billion into content farming. They could offer authors $5k advances. The authors still get to sell their books, but in turn for paying the advance, OpenAI can use the novel in their model training. $10 billion would get OpenAI access to 2 million full-length novels. And they can commission works in whatever genres or forms they need most.
Or, they could likely for even cheaper just directly pay authors for the rights to their work. I imagine many small authors would be willing to license their works for a few hundred bucks. After all, you're just trying to build a general language engine; you don't need to hire A-list authors for this. There are no shortage of small-time authors out there would let you throw their novel into the grist mill for a few hundred bucks. There are no shortage of aspiring authors who would happily take a $5k advance, write a novel, and let you use it in your AI training as long as they could retain the other rights to the work.
So that’s it. We’re locked in with only OpenAI being able to train things to the level they did. De facto monopoly only broken by china who doesn’t care about the ownership of bits on the internet.
Yeah, this was my exact thought when this story broke last week. Claim “the market needs regulation!” knowing that you (or friendly parties) will write the regulations such that you build your moat after you have extracted enormous value from using other people’s content freely, locking the current players in as what we’ll have “forever”
I mean, the logical thing to do would be to forcibly tear down everything they built and force them to do it publicly audited, from scratch. Bet that shrinks that $80 billion a good bit. Plus if they are found guilty of violations, and discovery reveals the extent of it, every copyright holder could potentially seek punitive damages of $100k per work, per violation. And if they use RIAA math, every time they copied that work to another computer is a potential violation.
Especially now that they'd need to keep track of a massive compliance database.
There is an alternate path. You could get authors to donate their works to an open source nonprofit project that is prohibited by its charter from engaging in any commercial activities, or owning any subsidiaries that do so. (Trying to avoid OpenAI's weird structure here.) These AI models could be created as public projects by people donating their time and their computing resources to the project. Authors could in turn freely sign over the use of their work for the purpose of AI training.
Or, amusingly, we could train our LLMs from public domain works. Think text from before the 1920s or so. Novels, newspaper archives, scientific journals, databases of private diaries, etc. Then the resulting AIs can be not only racist, but antiquated racist! They'll have to constantly filter them from defaming the Irish and Italians.
You don’t get to GPT4 quality LLMs by having random works being donated. Words aren’t just fuel you put in the vehicle.
Quality matters. Extensive, varied and well curated datasets is what’s needed. A bunch of low quality fanfiction will not cut it.
Furthermore it’s like looking at what’s been achieved by the internet in the past few decades and saying “Yep, let’s not make something amazing with all this — instead let’s ask people to redo it all for free”.
OpenAI broke copyright. Yep absolutely. So did Spotify and Netflix. So did a LOT of other companies throughout the years. Companies providing useful services for the world, from entertainment to productivity.
Is it ethical to profit from it the way they do? Absolutely not. Is it reasonable to ask them to build it under the current framework? No, it’s not reasonable either.
All the framework is doing is preventing ethical people from building these amazing tools and leaving it only for those willing to break the rules. We need a different, sustainable system, not more people slaving their work into the public domain.
There's https://huggingface.co/Mitsua/mitsua-diffusion-one which is a model being trained from scratch using only public domain and voluntarily donated material.
Interesting, thanks!
I literally had this conversation with a friend of a friend who works on Copilot.
(paraphrasing)
What’s funny is she doesn’t even work for OpenAI. She works for Microsoft. To her it’s more like rooting for her home team and less a way to get rich. FAANG companies are the Bay Area’s sports teams.
Note that they aren't just talking about full books, here's the actual quote cited in the article that the headline is from:
For proper stories I can at least see the argument—and if nothing else they should have to buy the book—but I dunno if extending that to "pay every Twitter user per Tweet" is as reasonable. (No idea about the legality though, I'm talking in the plain ethics sense.)
Presumably they do. If someone could prove that OpenAI or anyone else was using pirated material in their training data, then that would set them up to argue that the number of infringements somehow scales with the amount of training. Even with conservative penalties for copyright infringement, that would obliterate a defendant.
This is almost exactly what is currently being litigate in regards to this whole mess. They paid what the rest of us pay for most of this stuff. Nothing. The problem is that they're using it to make a profit. I'm not reading a new york times article and then re-posting it as my own writing for my newspaper that I sell. I explicitly do NOT have the right to do that, and if I want the right to do that, I have to pay. And we know they do not have those kinds of deals in place because it'd be insane.
They're arguing the AI is trasnformative enough that it counts as a new work or whatever, and those who feel ripped off are calling bullshit since it can, with proper prompts, literally just recreate the original work in some cases, or has the signatures of the artists used in the resulting output.
And that's just focusing on art, not video or text where it looks even worse. Can I just republish someone else's book with all the names changed and not be infringing? Probably not, but you can clearly use the AI to do exactly that, and so on.
Where I see it getting really thorny is when you start asking, "how similar/different are the processes of LLM training vs how humans learn?"
I was not born learning how to speak any language. Yet, now I can speak and write with some reasonable coherency. How did I gain that ability? While some of it was formally trained, much of it wasn't. I learned most of my speech abilities simply by interacting with people, picking up vocabulary, syntax, grammar, etc.
But how does such learning work? A young child will hear a word and try using it. They initially have no idea what it means. However, by using it, they receive feedback from other people around them. Eventually through trial and error, they learn the correct way to use the word. Even in formal educational settings, much of learning occurs through a practice/feedback cycle. The processes we use to learn don't really seem all that different to me from how a LLM does it. Our minds are just more purpose-built for the task of learning language, so we do it a lot more time and energy efficiently than an LLM does.
But in terms of copyright, there's no obvious easy answer here. If you say, "no copyrighted works for LLMs," then you are now applying a more stringent standard to AI-created works than applies to human-created works. If that standard applied to humans, no person who has ever read a copyrighted work would be eligible to write a book.
Rather, I think the best path forward to address copyright and LLMs is to apply the same standards that apply to human-created works. The issues raised by LLMs are not new; the courts have been dissecting the question of "how close does it need to be to be copyright infringement" for generations. Does using a common saying, even if you first saw it in a copyrighted work, count as infringement? Probably not. How about a single whole sentence? What about a few sentences? What if it's quoted and properly cited? What if I copy your whole book, but I make it one big quote and properly cite it? These are the issues that the courts have spent centuries hammering out the jurisprudence on. I think we're probably best just applying the same copyright standards to AI writing as we do to human writing. If something is too close for humans to legally do, it's too close for AI to do.
And ultimately, this is probably the only viable path going forward. It's already often difficult to tell if something is AI-generated or not. In time, it will become even more difficult and eventually impossible. Even if you pass a draconian law that says, "no AI generated work can be published for profit," you then have to define what "AI generated" means. Search engines are all implementing AI into their engines. If I use a search engine while doing research for a novel, does that mean that novel counts as AI generated? If AI writing is indistinguishable from human writing, how do you even begin to prove that works are AI generated? What if I write a novel completely myself, but I use an AI-translation software to translate it into another language? What if I write the novel, but I use an LLM-powered spelling and grammar checker to aid in my proofreading? What if I do all the research, writing, and editing myself, but the initial idea for the work was inspired by an AI-generated blog post I read? Etc. As LLMs become more and more integrated into everyday life, trying to create any work without at least a tiny amount of AI assistance will be near impossible. Unless you do all the research with physical books at a library, and write your novel on a typewriter, some AI will likely be involved in some small way. With how ubiquitous this stuff will likely become in every aspect of our lives, trying to ban any use of AI in written works seems folly.
So I think the best path forward may just be to treat LLM-generated works just like human-generated works. Does your LLM tool regularly output paragraphs that have snippets taken from authors that, if done by a human, would count as copyright infringement? Then your company is now liable for that infringement. Does your model output a space opera that is formed from a distillation and synthesis of the thousand top sci fi books ever written, but isn't an obvious and direct copy of any of them, and it doesn't use long direct quotes from any of them? Well, a human wouldn't get in trouble for doing that, so neither should your AI.
By applying this standard, you completely avoid the impossible problem of trying to determine whether a given work is AI-generated or not. And you also avoid having to wrangle with precisely how much AI contribution is needed before a work counts as "AI generated."
As someone with a background in linguistics, I just want you to know that you have unintentionally opened almost every possible can of worms when it comes to this part of the field in these two paragraphs lol. I actually agree with most of what you say here and elsewhere in your comment (though I think the way LLMs learn is further from what we know of how humans learn the more you get into their underlying structure and points of failure) but I just wanted you to know that I could pick out a couple sentences from these paragraphs and start a huge fight at any meetup of linguists lmao.
Ok, now I'm really curious. Give me the juicy details! Just how many academic traditions did I completely step all over? :D
The big thing is about our minds being purpose-built for language. There's a lot of disagreement within linguistics about exactly to what degree this is true -- the idea that we have something in our cognition purpose-built for language. This on its own is only a little controversial, but it dovetails nicely into the holy grail of starting linguistics arguments: Universal Grammar.
Universal Grammar is a very historically important linguistics theory originally introduced by Noam Chomsky. It's also by far the best way to start arguments, as people tend to be passionately for or against it. To try and simplify it to the fundamentals in as non-technical a way as I can:
Kids learn language fast and well. Too fast and well to be learning solely from exposure to language around them plus trial and error. They don't have enough exposure to learn so much complex grammar from nothing. This is known as the Poverty of Stimulus and isn't super controversial on its own afaik.
To account for this, UG theorizes there are some very low-level fundamental language structure rules that we're all born with. This is the titular Universal Grammar.
The Universal Grammar is pretty low-level and allows for languages to vary in a bunch of different ways (called Parameters). They're like switches you can switch on or off for different languages that result in differences in their structure. This way when kids are learning language as babies, they aren't having to learn the super fundamental underlying structure, but instead they're learning which "switches to flip".
There's more to it and it's varied a ton as a theory over time and between linguists. But that's the core idea. It is... controversial. You can start arguments by asking about it, assuming the group of linguists you're in is diverse enough in their theoretical framework. Search for Universal Grammar on r/linguistics (which isn't even principally populated by people with formal education in linguistics) and you'll see what I mean almost immediately.
Related to UG, there's a linguist named Dan Everett who claims to have found a single language in the wild that violates one of the most fundamental parts of UG. He will inevitably come up when an argument about UG gets suitsbly heated. Just bring up his name around linguists and you'll get a ton of strong opinions. I'm holding back on sharing my own right now for the sake of not making this comment even longer and more rambly than it already is.
That's really interesting. Thanks for sharing! I think I've probably heard of Universal Grammar, Chomsky, and such somewhere along the line. But I couldn't recall the terminology. But I've definitely heard before of the idea that the structure of the brain is hard wired for language. I've definitely heard this concept as a possible explanation for why LLMs take so long to learn even basic language.
And by "so long," I mean the amount of text they have to digest. Some searching suggests GPT3 was trained on 8 million web pages, and GPT4 an order of magnitude beyond that. Let's say each web page has 500 words on it, and 50 million web pages are required to train an LLM, for a total of 25 billion words. The average adult, let alone a young child, can read maybe 200 wpm. To read through all the text used to train an LLM would take an adult 125 million minutes, or 237 years. It would take far more than a human lifetime to read through all of the text necessary to train a modern LLM. Even if this very rough estimate exaggerates by an order of magnitude, it would still take you decades to read through all the text used to train these models.
But it's actually much, much worse. This is writing, which is something you could theoretically do continuously. I can sit down in front of my computer and spend an entire day doing nothing but reading if I really want to. But we don't learn our initial language understanding from writing, we learn it from speaking. It's very important to talk to your children and let them overhear conversation, but no parent sits their 2 year old in a high chair and reads them books 16 hours a day, everyday. (At least I sure as hell hope not.) In reality, young kids probably only have the opportunity to actively hear people talking at most what, 20% of the day? Yet they still manage to learn.
In short, we are far, far more efficient than learning language than our best LLMs. And Universal Grammar, or just the more vague concept of the brain being "hard wired" for language, is a common explanation of this.
Oh yeah, the sheer amount of data you need to train LLMs to reach their current abilities is a great example to use for poverty of stimulus -- and the poverty of stimulus is also a great argument against people who think LLMs are the same as how language works in the human brain.
This makes me miss back when I was doing proper theoretical linguistics... but unfortunately academia isn't a great place for money OR good work-life balance 😔
I don't think that's the right metaphor here. The AI is (in the cases we care about) not re-creating the training data for distribution.
That is a problem, and AI companies probably can/should be liable for that. But from my reading of the article (I haven't looked further) into the lawsuits the article talks about, it sounds like the complain is on the copyrighted content's use for training. Not for the model's ability/tenancy to distribute the copyrighted content.
I agree that this is a problem, but In the event that we end up with a court decision in favor of OpenAI I'm really hoping to see someone start distributing small models trained specifically on certain texts (like textbooks) that can recreate whole pages based on a specific prompt. I would love it if the college textbook industry could be reshaped by AI.
I can imagine an ai model like "General Chemistry AI" trained on "General Chemistry: Principles, Patterns, and Applications" 3rd edition
Facts aren't copyrightable. Anyone can write a textbook for chemistry, physics, math, language, whatever. The "flavor" text is copyrightable; the wordy examples, the introductions, the discussion of the facts and so on, those are all unique elements that are copyrightable. But the actual core of the subject, that's factual and "open source."
Modern universities aren't about teaching. They're about forcing students to buy textbooks. Which come tied to websites, which require a unique one-time use code to authenticate your student account to. So you can't copy, borrow, or share someone else's textbook. You can't use last year's textbook. The companies aren't interested in that; they want each student to pay several hundred dollars, per class, for a new textbook.
They entice professors with online homework and quizzes and so forth, which go through the website, which needs the code. So the professor says "fuck it, I'll use Pearson's 2024 Chemistry 101" and three classes of thirty students each have to pay $295. And where a professor might not be down with that, their bosses in College Administration are (thanks bribes and payoffs!) and will lean on the prof to use the newest locked down textbooks.
It's a scam. But AI can't fix that. AI, anyone, can already write a textbook. They just have to sit down and do it. Textbook companies have corrupted universities though, and fixing that requires some fundamental changes in the system. Which no one's really interested in, since the textbook companies make sure to divert a portion of their guaranteed profits into maintaining the corruption in the universities.
Yeah, I don't know how much things have changed, but I don't think I bought a single new book for my entire degree. Indeed, my professors would strongly recommend that we buy used, borrow or go to the library.
As of a few years ago, the books were basically a farce at the university my cousin went to in the US.
They didn't bother to bind them. When you bought a book, you functionally bought a stack of papers that may or may not have been proofread, along with a very important sticker on the front with a one time use code that you used to assign yourself to a class on the vendor's website.
Sure, you could go rent the book, or buy used, but the one time use code was the key to the website that the university (or at least a good chunk of the professors) required homework to be turned in through. Having the book itself was inconsequential.
Are you talking about LLMs here or image generation? The article is about LLMs, where recreation of source materials is easier I think (didn't dig into it, feel free to correct me), but with current generative image models I don't think this has ever been proven.
I know of one study that managed to do exactly that, however it was using an early unpublished research version of Stable Diffusion which used a 1 or 2 orders of magnitude smaller dataset that was not filtered for duplicities and suffered from overtraining, because some images were present in 100+ identical copies. Despite that it took iirc hundreds of thousands of attempts to generate one reasonably similar copy of the original image, and that was when they knew the perfect prompt to use in advance. The results did not apply even to the oldest public version of Stable Diffusion which did filter for duplicates and used x times more different images to train it.
The following paragraph is (emphasis mine)
Then you should back that up, because it's a pretty strong claim that as far as I know is not true.
Literally recreating original work was afaik only shown in the study I mention (unpublished overtrained version and even there reproduction was exceedingly rare), and the only additional study I know of that used the first public version of Stable Diffusion claimed "copies" but in fact demonstrated only images that were similar in composition
You asked a question of @Eji1700
I was pointing out that your question was answered in the first sentence of the next paragraph in their post.
It was not me to whom you were speaking, so I'll not be backing anything up, was just trying to help provide clarity.
Ah. You didn't add anything to the actually important part of the message, so my mind sort of assumed nobody apart from the author would see a reason to respond, my fault.
I was genuinely trying to be helpful. But had nothing to add. Sorry that this led to confusion.
What if I started a blog where I analyze and critique NYT articles? Suppose for each article I excerpt and respond to the text paragraph-by-paragraph. And let’s say I break up my content so each article’s critique is in three parts, three blog posts, so no single post contains the full content of the article. Though arguably the full text could be reconstituted from my site.
I’m reading the NYT articles for free, legally. I’m monetizing my blog with ads. Is this legitimate? How is this scenario different from OpenAI’s, legally speaking?
The biggest difference is probably in the first factor of fair use, which is that an AI model is more transformative than direct quotations would be.
It is plausible that somebody could read your blog instead of reading the original articles. However, it is not plausible that somebody could craft prompts to retrieve the latest NY Times articles. It is not a suitable replacement for the original.
See for example the judge's objection in the Sarah Silverman book case, who wrote: “There is no way to understand the LLaMA models themselves as a recasting or adaptation of any of the plaintiffs’ books.”
There are situations where a text is printed verbatim, typically when it's been overtrained due to duplication. Though that's a little more complicated, and generally requires prompt engineering to target those overtrained texts. In most cases, these models are wholly different products than their source material and are likely to meet the first (and generally most important) factor of fair use.
All that said, your blog may also be considered fair use under the first factor! Commentary and criticism are generally valid use cases as well. It's just for different reasons, in this case.
What if I started a blog where I read NYT articles and then paraphrase them. I'll change the text enough that it's not direct plagiarism word for word, but the sentiment and research is coming entirely from that one source that I'm not paying for.
I don't have an answer for this! It's, to me at least, a genuinely good question. If a human did what LLMs are doing in regards to processing text to respond with later, would you be able to get away with it? Is it fine if simply the words are different?
Fair use involves transformation - as long as the paraphrase fundamentally changes the product so as not to rob the desire for the original product then it's fine... In theory, you'd still probably have to argue that in court though.
Now, if you added your own personal experience and minimally used direct quotes from the article then you'd have some ground, even if your 'experience' was 'long time NYT reader'.
Yeah I think if something like this went to court it would heavily depend on how transformative the paraphrase actually was -- so it would, like most fair use, be very case-by-case.
Does it matter if you're paying to access the articles or snagging the paper off a neighbor's porch? Metaphorically speaking.
IANAL but I would say its on the line at best. If I was the copyright holder I would make the argument that you didn't need to quote that much to make the point on any given article and that combined you've sampled every bit of the work.
I’m surprised at the number of upvotes you got here, this is a very piracy-friendly site. I get absolutely roasted any time I even suggest that piracy is maybe less than ethical.
I made no such suggestions, at least I didn't intend to.
My point was "the establishment" goes after people tooth and nail for downloading entertainment without paying, BUT somehow Silicon Valley seems to be promoting a sense of entitlement that they can take whatever they want without paying. In other words, a double standard.
Err, so which is it? Should pirating be legal for everyone or for no one? Or should it only be illegal for OpenAI?
I'm not going dive too deep into this piracy talk but there's a distinct difference between a person pirating for personal use (for themselves and others), and a company/group (or an individual potentially) pirating for profit.
If pirating is illegal for ordinary people, then it should be illegal for tech companies.
From the article
This is all this topic ever boils down to.
The statement from OpenAI about needing copyrighted material is technically neither surprising nor controversial. IANAL but based on my limited experience with fair use for my own art, it seems like AI models are OK here as far as lawsuits from content creators (assuming they're paying for access to the copyrighted material where applicable).
The open issue is if fair use or copyright law should be amended, and that seems more likely to play out in legislation than in the court.
It still seems to be not completely settled yet how to interpret this within fair use. But even so, I personally think it is fair to argue that it doesn't seem to be the original intent or "spirit" of the fair use term. It was made for things like being able to quote a book in a review, use small samples from movies in a song or whatnot. Before the AI models the use was limited in scope and impact. Now we have machines that just process practically everything every human has ever created and enabling big companies to resell and centralize it all for fun and profit. It is pretty far from the original goals of copyright to protect individual artists and their work.
I agree.
In (my understanding of) the intent of fair use, party B wants to use the work of party A in some minor and creative/derivative way. The law wants to protect the ability for party A to be paid for their work, without limiting the creativity or potential for party B to create value.
Medium/long term AI obviously hurts artists as you said. But short term it's still early days and artists aren't seeing damage yet. So I think right now it does make sense for lawmakers to write out what we want to happen from here, rather than courts trying to make case law based on how the reality has caused the intention of the laws to radically deviate from how the law is written.
The issue is copyright is far too all encompassing and is fundamentally at odds with the way human culture reproduces itself. Nothing that goes into training these models is different from how human beings learn about the same things. We just don't pretend that a copy of something in a human brain is infringing.
At the moment. I'm sure some lawyer at Disney just got a boner and doesn't know why.
I really wish this all ends with saner laws around intellectual property, but I have to assume in reality we'll be left with an even worse state than what we started with.
I'm not sure I agree with this analogy. If a human views some art, it's only consumption. If a human is inspired to make a unique work based on the art, that is certainly legal (thinking about all of the Van Gogh style objects for sale as an example, even though copyright doesn't apply there).
If a human saves some identical representation of the art (screenshot or picture), even if only for personal use, that would technically be pushing the boundaries of legality. If a human used pieces of that saved image to create their own work, I also don't believe that would be legal, especially if they're profiting from it.
So my personal view on this is inspiration vs copy-paste. I suppose it's up for debate on which of these executions AI uses, but I'm inclined to believe it's somewhere in the middle (or perhaps a bit of both).
Another example could be a video game. Let's say I (legally, even) have access to the source code and assets from a video game. Would I be allowed to change the colors of some of the assets and release it as my own game? Even if I was, is that even remotely ethical?
This isn't really true. By and large, it can be legal if the resulting work is transformative; this is part of what fair use is intended to cover. Whether a specific case is fair use (and thus legal) or not is decided on a case-by-case basis by the courts (and whether they're profiting from it or not is only a small piece of how this is assessed). But collage art and blackout poetry definitely aren't illegal per se, which this would imply.
I made a much longer comment elsewhere in this topic on whether AI training is fair use of the copyrighted training data. The short answer is we don't know yet because there are arguments either way and it hasn't been decided in any courts yet. But it is noteworthy that it exists when we draw comparisons to human creativity, even when we note differences between human and AI creation on both a legal and philosophical level.
In any case, I don't think that human inspiration and LLM training are analogous for other reasons. They're simply not that similar underlyingly. This is especially true when one understands LLMs on a technical level and the differences in how they acquire language compared to human children. But even without getting into any technical details, on the most abstract/philosophical level, a human mind exists without exposure to any particular inspiration. You could never expose a human to any creative work and they would probably still be capable of creating something inspired by their own experiences if nothing else. Whereas these models are comprised of what they've learned from this training data and do not exist without it. Even if using copyrighted works to train AI is totally fine, I think comparing it to human creativity betrays a poor understanding of both.
I was with you until the last paragraph, where I think you're falling into a common pitfall of indirectly underestimating AI by overestimating humans.
Humans, like AI, are the sum total of their experiences. The way these models work are fundamentally much more similar to how our brains work than even most people who have a better than average understanding of how the AI works tend to think, because they don't understand as much as they think they do about the way the human brain works. The similarities are, frankly, concerning. We're dealing with forces that are more powerful than the people working the most closely with them tend to realize.
Now, do we have a fully working general purpose AI brain yet? No, but we have a pretty good visual and language cortex for it.
A much better one than people who have a good understanding of AI but a weaker understanding of the brain, or vice versa, tend to think. We're making the building blocks of a true general AI and the people who should be most conscious of that are the most likely to dismiss it, because they know how simple the building blocks -- if not the connections they make -- are in the AI, but don't realize the same is true of the human brain. The simple fact that we don't really understand what these models do with the training data should be concerning enough, but there's this attitude that brushes aside that uncertainty about the emergent complexity of these systems just because the starting point is simple.
I honestly had to click through to see what I'd originally written because I largely agree with what you say here! The one thing I would push back on is the notion that humans are exclusively the sum total of their experiences -- I don't think that's something that's really well-established on a scientific or philosophical level. Certainly the sum of our experiences is an absolutely huge chunk of what makes us who we are, but the nature vs nurture debate has not been solidly resolved as 100% nurture.
My point in that last paragraph was not to undermine the emergent complexity of these large models -- they are crazy interesting and complex, and I think we should try to take an interest in what structures emerge within them (despite how difficult it is to actually observe those structures, unfortunately... I'm very pro-AI explainability research!) But just because the human mind and AI both contain extreme complexity emerging from simple building blocks, that doesn't mean either those simple building blocks OR the resulting complexity work the same way underlyingly. Despite the name, neurons in a neural net are not actually so similar to human neuron cells -- they were inspired by the concept of how human neurons work, but the comparison only works on a high level and breaks down the more you know about human neurobiology (or so I'm told -- I've had to rely on other people who know more about that telling me so!)
But in any case, I think you're extrapolating past the point I was trying to make in my last paragraph, which doesn't rely on any argument about AI's fundamental complexity. It's merely that humans and their creativity can demonstrably exist without any relevant input data -- unfortunately, we know this for certain. The unfortunate Genie, a highly abused and potentially intellectually-disabled child who spent the first 13 years of her life without any exposure to language and being beaten whenever she mare noise, could still tell sophisticated stories using pictures. This stands in contrast to any ability an LLM has to tell a story or a visual model has to create art, which is 100% based on their training data (and indeed, requires FATHOMS more of it than a human does to learn the same skills), even if it results in fascinating emergent complexity.
I don't say this to diminish how capable and complex these models are, but just as an example of ways in which human creativity differs from the output of an AI underlyingly -- there may well be something deeper underlying human creativity that extends beyond our own internal equivalents of a language model. I'll leave studying that up to those in that field. Perhaps someday we'll make a more general AI that models cognition in a way that parallels this, allowing it to create even when it has been deprived of input. But I don't think it's disputable that current AI models are not like that.
IANAL, but I like to poke at these things as if I am self-defending. Would love to have a real lawyer evaluate my thoughts.
From what we've seen so far....they're not. They're banking on fair use to shield them from needing to pay for the copyrighted material, the way it was granted for search engines.
My understanding is that copyright law by and large is a creators-first law: Unless covered under exemptions like fair use and first sale, the copyright holder has all the rights. It doesn't forbid training, but it also doesn't allow training. Hence why they're banking hard on fair use...it's the only way they could get away with it without purchasing at least 1 copy of everything.
The real critical thing: Fair use in one context does not necessarily translate to fair use in another context. The four main principles of fair use (in the USA) are the primary determining factors, and while case law is often cited....everything is fungible. And it's entirely possible that we may find text-based generation fair use while we find image and audio generation not fair use. That's how messy copyright is. Below are the 4 principles, with my commentary on how I think they'll be applied. Bold is their emphasis, italics is mine.
So, this is certainly a commercial use, making it much less likely to be fair use. However, I'd say it's definitely a transformative use, which makes it more likely to be seen as fair. I think it will be down to the part I have highlighted: whether or not it substitutes for the original use of the work. And I think there's a solid case to be made, especially by the New York Times, about how an AI trained on its data could reduce demand for NYT copyrights.
Diverting to art models for a moment: If the training data consists only of copywritten cartoon frogs (CCF for short), and someone says "give me a cartoon frog", that almost certainly reduces the demand somebody might have had to license a CCF. While traditionally, the original artist would have had to show direct harm of that directly to their work....the companies arguing that they need everything might give an opening: Since they need all CCF, any generation of a CCF reduces demand for the collective input CCF. And expanding that to the likenesses of actors and musicians (and the works they create)... there is a messy unsolved space that OpenAI may well have left an opening to. That would be how I would try to attack if I was self-defending...and I'm betting the NYT legal team is gonna as well.
This one is almost a direct strike against OpenAI (and others), and I think they've basically just conceded it.
So, we know they consume everything. That on its own is worth mentioning, but we know they generally won't spit out an entire copywritten work. But the "Heart of the Work" aspect, as italicized, could well be a downfall, if the eventual (not necessarily current) models will be able to spit out assorted "hearts of the work" at the will of the user.
And this one, particularily the highlighted bit, is one that is very interesting. I'd say its reasonable to say that it's probably not harming existing market (for now).... but I do think it potentially has massive impact for the future work. It's easiest to think about this for musicians and actors. If someone can just spit out Matt Damon's voice saying whatever they want, that very much harms Matt Damon's future prospects for work. Since the models would not be able to imitate Matt Damon without consuming works Matt Damon certainly has some degree of copywrite claim on, it follows that he may be entitled to royalties for those source works whenever someone prompts for a clip of "Say this like Matt Damon".
And again, because this is now taken at the aggregate (because they can't do this without it all), the question I would posit to the court: Does consuming all copywritten works reduce the demand for any copywritten works? Because if so, there is provable harm, it just becomes hard to prove whom is being harmed and by how much.
Fair use is almost entirely court proceedings. Law being amended might be legislation, but that's not gonna happen until a lot of court proceedings shake out.
Somehow, I suspect, all the people eager to flame AI and see it banned, would still want it banned even if it could be 100% documented the model was 100% open source.
It is not a violation of copyright to read something, consider and think about it, read other works and consider/compare/think about them in growing total, and then (sooner or later) use your conclusions to create something of your own. Replace "read" with "draw", "paint", "perform", "sing", "compose" or whatever else you want.
You can't ban technology from evaluating and analyzing work (be it written, drawn, painted, sung, whatever) without also banning that for humans.
I asked before, and no one responded. Write the law that says "technology can't analyze art." Go ahead, someone write that law. Let's see the text of the law. But ... the law can't make it illegal for a human to consume art.
And it's not as easy as saying "technology bad, human good."
What if I run a search algorithm through the text of a book? Did I just infringe, simply because I wanted a word or phrase distribution or some other data array generated? What about search engines, do they infringe when they analyze text, images, sound? What about an ereader or a web browser; do those infringe?
What about a professor, or a self-learning artist seeking to improve her craft and her understanding of it; do they infringe when they sit down to read (or view, listen, whatever) many, many, many works in the artistic field they're most interested in? Does it mean they can do that, but only longhand on a pad of paper? What if they want to use Excel? Or some other software?
Does the housewife who loves romances, and dreams of writing one of her own and becoming the next Danielle Steel, infringe when she obsessively reads every romance she can get her hands on? When she takes detailed, meticulous, indexed notes of everything she can figure out about the books? She tracks every detail she can about characters, settings, about pacing and conflicts. Monitors how long the books spend before the would-be couple meets, when they first fight, how long they stay broken up, how close to the end before they get back together?
What if she does all that on a computer? What if she has the computer help her search for keywords or phrases? What if she's a programmer by day, and codes up some analysis routines to help her learn just how these other romance writers did it?
Are the stories she writes, striving with every bit of mental acuity she can bring to bear to emulate what she's learned and concluded from her research, infringing on those she draws upon?
And she can do all of that without paying the authors a single cent. Libraries. She could be one of the best customers of her local library, and those authors would never have gotten any payment from her while she reads and researches their books.
Hmmm, I wonder, can an AI researcher check books out? Can an AI researcher take a camera to the local museum that allows cameras? A rock concert? A spoken word performance? Can an AI researcher hook their computer up to the same internet feeds you and I use, daily? I can read the New York Times if I want, same as you. I can scroll through pictures and anything else available on the internet.
So can the computer.
What I can't do, what you can't do, what the computer can't do, is sell copies of what it reads (or views, listens to, whatever). That's a violation of copy right. The right to copy. But that right only applies to copies. Exact copies. Identical copies. Not "sounds like" copies, not "kind of similar" copies. That exact story, that exact image; that's what the copyright's on. A specific tangible work.
But ... I can do exactly what that housewife does. So can you. So can anyone, including the computer. I can read my favorite genre until I can quote the top twenty books by heart. And rattle off the specific details of the next eighty to round out the top hundred. Then I can sit down and write my own story in that genre. And, if they're words I came up with, words I generated and put together in the order I did on the page ... it's my story.
It has no legal weight or relevance that I sat there with those other authors' stories strong and fresh in my mind while I wrote. The story I wrote is one I wrote. The computer can do the same thing; generate words that form a story. Or put pixels together in a 2 or 3 -D space to create an image, a sculpture. Notes on a field to create music or a voice.
Name an artist who learned in isolation. Who never, ever consumed someone else's art. Who never looked at another painting, listened to someone else's song, who never read stories. An artist who just sat down in the void they inhabit, and busted out art.
Oh wait, all artists consume art. The art of others. That art influences them. Well gee, those assholes must be stealing from those other artists. They should be sued.
AI does nothing humans don't do already. It just doesn't.
Unless she pays for it or otherwise legally attains it via first-sale rights (ala libraries or used book sales)...yes. If she scans that physical book in, then sells or returns that book, that scan is a violation. If she rents a digital book from the library....the library is paying that license on her behalf.
The computer can not check out books from the library. It does not have the ability to, of its own volition, apply for a library card. The AI researcher could, but transcribing them into another format or breaking the DRM to have it added to a data model would be a violation.
These are not intelligent beings. They are computer programs, doing exactly what their programmers told them to do. It was the programmers (OpenAI) whom potentially violated copyright...nobody in their right mind is arguing that the LLM itself is doing it.
OpenAI (almost certainly) took a bunch of illegally obtained books and fed it to an LLM. OpenAI is claiming this is fair use. It's not fair use if I download books3 and read nearly every copywritten book in existence.
It's fair use for Microsoft to crawl the NYT to index everything. It's questionable if these perfect copies, being used outside the original scope of providing search results, is fair use.
I suspect that OpenAI is saying this to set up their fair use arguments. In particular, arguing that the public interest in the existence of LLMs out weighs the harm to rights holders.
Here is a long'ish article going into some of the legal details of fair use with AI training data.
https://www.thefashionlaw.com/ai-trained-on-copyrighted-works-when-is-it-fair-use/
I also suspect that future legislation will deal with these issues eventually. I believe the EU is looking at laws that would allow training on copyrighted materials for research and education, but not commercially, though who knows how things turn out.
Personally, I'm skeptical of OpenAIs claims, and how they have downplayed the extent to which these parameter sets can act as a sort of compressed version of the data. I don't see why, on its face, they should be able to keep a complete encoded store of other people's works against which to charge money for queries by an inferential algorithm. Even interpreting the technology as generously as possible, it is charging money for queries against a very detailed map of latent variables expressed in the works of others, without any prior consent.
Coupled with the fact that OpenAI almost certainly spoofed search engine indexer user agents to get the data in the first place to bypass paywalls, it's hard to argue that they have clean hands in this.
Public interest has been thrown to the wayside by rights holders through special interests copyright legislation. In the last century it's more commonly associated with Disney, though there were other copyright laws in the past that were for people with special standing, wealth and privileges etc. At one point in the past a copyright was 14 years with an option at the end to extend it for another 14 years and that was it. The whole point of copyrights should be public interest, not rights holders.
Copyrights should be something like 20 years give or take some years if you want to argue, not what it has become now. I'd find it interesting what an LLM could and couldn't do if the training data was 20 years old. It would certainly narrow the scope of what OpenAI could argue is fair use for this use case if our copyright legislation wasn't so totally broken that they don't happen to have a valid point behind their argument even if their usage may not necessarily fit the spirit of it.
The logic behind patents being 20 years and copyrights being pretty much longer than most individuals lifetime is mind boggling to me. The ability to copy something for free, at no cost to anyone, that should be seen as a massive win for humanity. Imagine if we could replicate food at no cost, just whatever amount of food we needed. Obviously if you consider externalities such as it might increase overpopulation even more then it's a possible downside, but my point is that the actual ability to do it would be a miracle. You could solve for the external factors in other ways. Simply denying people the ability to do it because it might prevent someone else from making gobs of money farming is just nonsense.
Now when it comes to something that isn't seen as necessary for survival, like various intellectual "properties", then sure it makes sense to balance public interest now and public interest in the future, because that's what it is in the end, it's all public interest. Restricting someone from doing something that doesn't harm anyone needs to be done for public interest. We're using public resources (the government) to enforce such laws. Only in an extreme capitalistic worldview would people assume there's some inherent right to monopolize something to extract maximum profits regardless of whether there is a public interest in it or not. Limited copyright terms are in the public interest, but lifelong or century long copyrights are not.
I think it would be difficult for OpenAI or others to build the same level of product in a world where copyrights are 20 year terms, they'd be going back to pre-2004 mining of forums and what not to build a base dataset which in many ways wouldn't be super useful today, but it would give more than enough data to make the system functional and logical, meaning it can string together coherent sentences still even without copyrighted data, and it would potentially make it so that paying for specific data sets would actually be feasible because they have a 20 year gap to fill to make it more relevant.
This is a really interesting point. A lot of the dialogue is around creators and their rights: an artist is entitled to their work for life (plus additional years), and OpenAI is violating their rights by using their work without permission. But are those rights good for society as a whole?
(I suspect no. Ideally, we would return to short copyrights, but I don't have much faith that Congress will do that.)
But supposing we did have short copyrights again… Personally, a 20-year-old LLM would still be tremendously useful to me: maybe it wouldn't know the latest slang, or what a "smartphone" is, but it would still know enough of the English language to function as a writing assistant.
Copyright creep and questionable lawsuits are absolute poison to music. Music has fundamental laws, and concepts like genres depend on intentional similarities. But we now have high profile lawsuits over things like chord progressions and ostinatos obvious enough they exist in classical music.
In the 80s and 90s, the advent of sampling hardware created an explosion of creativity (the dawn of house, techno, hip hop, etc) that was rapidly clawed back by lawsuits over musicians doing what music is about: respinning ideas into new ones, the same as any form of communication. And we are that much culturally poorer for that blatant theft from us that the chilling effect of copyright created.
Copyright gives a few a fiefdom and steals rights from everyone else. Your basic right to expression is being infringed upon, so the 1% of artists (and their grandchildren) can rent-seek forever.
On a similar train of thought: Has OpenAI (and the protectionism of its inevitable profitability) done more to damage the copyright and likeness rights than piracy ever has? And what is the real fallout of that damage?
I think there are a LOT of things that are bad about current copyright law in ways that benefit huge corporations more than small creators whether the huge corporations are acting within the law or not. I'd honestly be for a HUGE opening-up of copyright law if not abolishing it altogether. But I think when it comes to cases like this, people are mostly speculating on how they'll turn out and their impacts on the current landscape, which relies on current copyright law and its interpretations.
Right; my entire post was speculating about how things will play out with the courts with the laws as they are today. Not that I'm opposed to a philosophical conversation that other replies seem to want, but how I think things should be in an ideal world is not the same as how I think things will play out in case law.
People also seem to love being obtuse with my user agent reference rather than the statement of intent to bypass content protections. 😂 Internet's going to Internet.
Have a great day!
I suspect this will be a contentious stance, but by no means would I consider user agent spoofing to be “dirty hands.” There’s nothing unethical about crafting your own HTTP headers.
Intentionally bypassing paywalls to harvest others content for commercial use doesn't sound clean to me.
Spoofing your own for non commercial purposes seems entirely different.
Eh. User Agents have been considered completely unreliable for a long time. There’s enough crawlers reporting as users and users reporting as crawlers and other misrepresentations at this point that it’s a lost cause.
Chrome the browser, for example, reports itself as:
Then there’s all of the user-initiated changes with privacy extensions or manual modifications.
The robots.txt file is also simply a suggestion.
Whatever the means used, the point is that intentionally bypassing paywalls to copy content for commercial purposes doesn't exactly paint them in the best light.
They're wrong. The MIT license is the most popular software license in the world and the majority of projects on GitHub are licensed under it. Content on Wikipedia and the Stack Exchange network is CC-BY-SA, and images on Wikimedia Commons are public domain. These all permit usage by essentially anyone for any purpose, so long as modifications to the content are propagated under a compatible license. And these are massive datasets.
What OpenAI has done to get a hand on their datasets is explicitly in the wrong: Twitter, Instagram, the New York Times, Getty Images, etc all have provisions in their terms of use stating explicitly: "you are not allowed to scrape content without prior consent [of the company]". The legal backing of "Terms of Service" on websites I do not know, but OpenAI very very much broke them.
Tangentially, I find it to be such a shame that much of the individual pushback against these soulless corporations sucking up people's life work for profit has focused on reinforcing our existing bullshit intellectual property laws rather than accepting that they are harmful and cut us off from our own culture, and not put thought into shattering the copyright system and embracing the better, already successful ways to fund art and artists (Patreon, Bandcamp, Itch.io, and the like).
Well, that's a long time coming since we transition from a morality based society to a law based society eons ago. Weasel words like "but is this crime, really" or "does it really say...." or "surely it doesn't include ...." Probably since the time we could reach for fruit that isn't allowed.
What we COULD do, though, is to fix existing laws to make them less BS, and I believe that is being done all the time :)
(False dichotomy alert)
The alternative is a society that has super vague "laws" like "you can't performs actions that may hurt the country" or "it's illegal to act on what may be foreign influences": they are so broad they can only be enforced by fear and bullying - the fear is part of it all the time - everyone could be guilty at a whim because "you should know better" even if it's not written down, it is a guilty verdict.
Image if any other defendant tried to justify their infringement of a plaintiff's property rights on the basis that the infringement was "necessary" so that the defendant could secure a commercial advantage!
Tech companies are really really good at rhetoric and dressing old ideas up as new ones, which helps to mitigate negative pre-conceived notions people hold about the old ideas.
Your honour, only the tiny hands of malnourished orphans working in dark caves can weave these rugs with such intricacies: it's a necessity literally woven into the fabric of these rugs. Feel how luxurious this one is - it blinded seven orphans, afterall. Or do you prefer this one which blinded ten? (Idea from Margaret Atwood's The Blind Assassin)
What rubbish. If it can't exist without violating the rights of others then go back to square one and start paying for training datasets. It'll solve all kinds of problems as well, like naked Asian woman, hate speech bias, and you could even properly do hands if you have a large dataset trained especially on paid model hands with tracking dots and tagged with specific gesture, angle and light settings.
https://huggingface.co/Mitsua/mitsua-diffusion-one is a generative AI model being trained from scratch using only public domain and voluntarily opted-in material. It's still a work-in-progress and currently it's not yet very good, but I think it's great that they're at least trying to create an ethical AI art dataset.
I was in one of the closed beta tests of OpenAI DALLE2. During onboarding we were told that the AI art generator was strictly "for research and non-commercial use". A few short months later, it totally became a commercial product.
Corporations using the "fair use" defense to take creative work en masse, without consent, and then use it to create commercial products that then displace the very creators of said creative work, isn't very "fair use" at all. Fair use is meant to benefit the public. But what's happening with AI is the reverse-- it's corporations extracting value (data, creative work) from the public, then reselling it back to the public and privatizing the profits.
Info law scholar Ben Sobel has an excellent paper about the inadequacy of the fair use doctrine in handling AI art issues (because AI issues involve not only copyright, but also data privacy), and in it he explained how it differs from the Google Books fair use defense:
In the same paper, he wrote about the distortion of fair use in today's digital economy:
AI that is trained on the commons should give back to the commons as well. I would support laws requiring commons-trained AI tools to be free and open-source.
I would also support laws requiring commercial, corporate-owned AI to properly license material used for training. If corporations consider copyrighted work to be valuable enough to be necessary for training, then they're valuable enough to pay for. This would respect artists' consent, but also privilege corporations that already own large IP collections. So I would also like to see copyright law adjusted to prevent corporations from hoarding IP, especially for a long time after an artist's death, as they do now. The default copyright term can also just be shortened in general. I don't see why it needs to be so long after the artist's death, since copyright's main purpose should just be incentivizing and rewarding the artist.
Going beyond copyright, I would support laws strengthening our rights to data privacy, and reversing the current trend of corporations harvesting our data and using it in ways that maximize their profit, to the detriment of the public good. Here's an article about how data harvesting perpetuates systemic oppression. For example, RELX’s LexisNexis products have been used to surveil protesters and immigrants. Btw, Stable Diffusion's lawyer is Mark Lemley, co-founder of legal data analytics company Lex Machina which was acquired by LexisNexis in 2015.
IANAL, just an artist with personal experience pursuing infringement cases.
Often what's considered impossible today becomes possible in the future. This is a pretty childish and laughable defense. Hope NYT wins.
Worth mentioning the whole conversation around LLMs seem to neglect the 50+ year history of technologies being labeled as "AI" in their early stages and later not being considered AI after becoming well-trotted and better understood (e.g. speech recognition, natural language processing)
'Impossible' to run this fast without steroids, Ben Johnson says.
Sorry to necropost, but I feel like I’m alone in my opinion: I don’t really care. I think the existing copyright systems for the largest part don’t actually benefit society, just the copyright holder, so to me all the discussion about if it’s legal just isn’t interesting. Is it moral? That’s certainly a question to answer, but in my mind, if it furthers development of a technology like this I’d be inclined to give it a pass. For research, I think this should be allowed.
And an additional thought: even if it’s not legal or moral, people had to have known that it could happen. I mean, the internet is open to anyone.
Still hoping someone will go the IAN route, I want to talk to Elohim or Milton.