62 votes

The New York Times sues OpenAI, Microsoft over the use of its stories to train chatbots

42 comments

  1. [31]
    updawg
    Link
    Gifted link to the NYT's story, in case you're interested in what they have to say:...
    19 votes
    1. [30]
      rish
      Link Parent
      As per this, Bing AI bot has actually caused financial loss to NYTimes by removing their referal links.

      In one example of how A.I. systems use The Times’s material, the suit showed that Browse With Bing, a Microsoft search feature powered by ChatGPT, reproduced almost verbatim results from Wirecutter, The Times’s product review site. The text results from Bing, however, did not link to the Wirecutter article, and they stripped away the referral links in the text that Wirecutter uses to generate commissions from sales based on its recommendations.

      As per this, Bing AI bot has actually caused financial loss to NYTimes by removing their referal links.

      47 votes
      1. [28]
        arqalite
        Link Parent
        Would a good lawyer be able to make a convincing case out of this? My layman brain seems to think so. Personally I'm split on this for a myriad of reasons, and honestly don't want either party to...

        Would a good lawyer be able to make a convincing case out of this? My layman brain seems to think so.

        Personally I'm split on this for a myriad of reasons, and honestly don't want either party to lose - but under the current copyright system this is very much blatant copyright infringement.

        In my ideal world, content creators and AI firms would figure out a solution to have the authors compensated fairly while also allowing models to scrape data freely, but that's never going to happen.

        10 votes
        1. [22]
          DavesWorld
          Link Parent
          It's not going to happen because there are a lot of "authors" who seem to think the fact their text was read and analyzed seems to guarantee them instant royalties. That the fact that (insert some...
          • Exemplary

          It's not going to happen because there are a lot of "authors" who seem to think the fact their text was read and analyzed seems to guarantee them instant royalties. That the fact that (insert some AI here) analyzed their text means the (AI firm) owes them money.

          Which is not, at all, how it works! It just isn't.

          Do professors owe writers royalties for sitting down and painstakingly analyzing a body of works? Does a housewife who obsessively reads a genre, who begins taking notes, and then uses those notes and her experiences reading hundreds of those books to write one of her own owe royalties?

          Reading text, even reading it closely and carefully for the purpose of analysis for later use in your efforts to write like that text ... is not a copyright violation. You're not copying it. You're studying it. Study is not a copyright violation. Neither is reading.

          Art must be consumed if it's released. That's the point of releasing it out of the author's control; for others to consume it. The act of reading it, viewing the picture, listening to the song, whatever ... that's consumption.

          The fact that someone is obsessively consuming someone's art so they can go on to create their own art like it is not a copyright violation. It's what humans have done since the first caveperson scratched something in the dirt that some other cave dweller grunted favorably at.

          The modern attitude seems to be that the first caveperson who scratched a picture on the cave floor owns the concept of cave drawings. And had every right to seize a club and bash in the skulls of the other cave people for daring to violate the ownership of cave drawings. Only that first caveperson has the right to scratch in the floor dirt and make others smile.

          The Times doesn't own the concept of news, or product reviews. They own the license to the specific articles on their site. Not articles "like" those. Not articles "similar" to those. Not articles "in the genre" of those. But, in fact, those specific articles. Not any others. And certainly not any that someone who might have analyzed the shit out of the Times before writing might have written.

          It has occurred to me, as I've watched mostly European countries levy huge fines against international tech companies ... it's like a new form of taxation. They don't want to actually pass laws and revise their tax codes; they just want what they perceive as their cut. So they find a reason, levy charges, and demand payment in what's quickly become the billions from these global conglomerates, and enjoy fresh revenue they didn't have to piss off their citizens to get at.

          This is the Times doing the same thing. Except the Times doesn't have armed forces standing by to enforce a judgement. Thankfully, we're not quite to the cyberpunk future yet.

          Yet.

          Oh, wait, I just violated William Gibson's copyright probably. He owns cyberpunk doesn't he? Or is it Mike Pondsmith? Philip K Dick? Bruce Sterling? Guess I'll find out when one of them sues me.

          21 votes
          1. [9]
            sparksbet
            Link Parent
            You've framed this as though this is a settled legal question, but it absolutely is not. Whether using a creative work as training data for an AI model is fair use or not IS NOT SETTLED LAW, and...

            That the fact that (insert some AI here) analyzed their text means the (AI firm) owes them money. Which is not, at all, how it works! It just isn't.

            You've framed this as though this is a settled legal question, but it absolutely is not. Whether using a creative work as training data for an AI model is fair use or not IS NOT SETTLED LAW, and unless it is fair use or otherwise found to not be infringing on the work at all, there is indeed some form of copyright infringement happening. Especially given that OpenAI provably acquired these materials without purchasing a single copy of any of the books, because they sourced their books training data through piracy.

            Your comment is framing this as though the New York Times is suing over the AI being able to generate news articles, but that simply is not true. This case is particularly interesting because OpenAI's models were verifiably trained on articles from the New York Times, and people are now using the resulting models instead of the Times when looking up information -- maybe before I'd google something and read the relecant Times article, but now I just ask ChatGPT and it summarizes essentially that same article. This makes it interesting from a fair use perspective, since the effect on the original market for a work is one of the factors in a fair use analysis, and imo that makes this a much stronger legal claim than the authors suing over their books being used as training data.

            39 votes
            1. [4]
              vord
              (edited )
              Link Parent
              IANAL. Like, if a photographer approaches me, and asks to take my picture, I'd say yes. If said photographer then used that photo of me to then apply my face on many other things, say hardcore...

              IANAL. Like, if a photographer approaches me, and asks to take my picture, I'd say yes.

              If said photographer then used that photo of me to then apply my face on many other things, say hardcore pornography or on billboards, I would be quite displeased. I may or may not have a legal case, but at the very least the photographer is an asshole...and if enough assholes piss enough people off...that's how laws change.

              I think copyright holders stand a chance, for one simple reason: If I purchase a DVD, me playing it for some house guests is fair use. Me taking that DVD and then playing it in a high school auditorium and charging a $5 admission fee is not. Typically speaking, purchase for individual use does not translate to purchase for widespread use. Libraries are not permitted to buy one book, scan it in, then distrubute the scan freely.

              The companies can try to wash away what they're doing, but its essentially mass copyright and licence violation in service of selling a derivarive product.

              Suff licensed under Creative Commons non-commercial licences, IMO, wouldn't have a leg to stand on if the derivative product was non-commercial. But once they start charging fees for use those CC licenses might be able to make a case.

              I would say if a copyright holder, in any way, is able to create a prompt or series of prompts which dumps a non-trivial amount of a copyrighted work....they've got a case.

              I think it'd be good for copyright enforcement to be applied to AI training. Worst case, creators get paid. Best case is some copyright reform comes out of it.

              12 votes
              1. [3]
                sparksbet
                Link Parent
                I think the main legal argument for training using copyrighted work isn't that it's not copyright infringement (because I do think your parallels hold there) but that it's fair use for the same...

                I think the main legal argument for training using copyrighted work isn't that it's not copyright infringement (because I do think your parallels hold there) but that it's fair use for the same reason research or a search engine is. But IANAL so it's possible people are making other arguments as well.

                I may or may not have a legal case, but at the very least the photographer is an asshole...

                Fwiw, you definitely don't have a copyright case in that instance. Photographer unambiguously owns the copyright. There are other laws out there designed to be used in scenarios like this though, and that's the direction I see stuff like this going -- rather than an overhaul of copyright law, addition of other more specific laws surrounding the issues at hand.

                6 votes
                1. [2]
                  vord
                  Link Parent
                  Oh I wasn't suggesting copyright wrt photographer, exactly.. more to just demonstrate there is a path to be formed between "consent to use in one context does not neccessarily translate to...

                  Oh I wasn't suggesting copyright wrt photographer, exactly.. more to just demonstrate there is a path to be formed between "consent to use in one context does not neccessarily translate to another," which definitely exists for copyrights.

                  Search engines get a pass, in part because they are also forced to comply with fair use rules when outputting results...there was a time Google News just gave you the entire news article rather than a short blurb and a link to source. That is the result of a lawsuit.

                  Here's some examples.

                  Generally medium-length excerpts from books like a few paragraphs are ok, but like a whole chapter is not. In the case of news articles, I could see any more than a handful of sentances violating the principle.

                  A common theme is that it's not fair use if it reasonably would diminish the ability for the copyright holder to profit from their copyright. Publishing half a news article can definitely cross a threshold much easier.

                  4 votes
                  1. sparksbet
                    Link Parent
                    Yeah one of the principles of fair use is only using as much of the material as you need, but I'm not sure how well that'll pan out legally in this case... definitely interested in the legal...

                    Yeah one of the principles of fair use is only using as much of the material as you need, but I'm not sure how well that'll pan out legally in this case... definitely interested in the legal conclusions here

                    1 vote
            2. [4]
              kru
              Link Parent
              Isn't this what the news does, though? I see all the time on local stations, stuff like: "The NYT published a bombshell piece today. Here to talk about all the juicy details is our own so-andso."...

              maybe before I'd google something and read the relecant Times article, but now I just ask ChatGPT and it summarizes essentially that same article.

              Isn't this what the news does, though? I see all the time on local stations, stuff like: "The NYT published a bombshell piece today. Here to talk about all the juicy details is our own so-andso." <cut to talking heads for the summary and commentary>

              4 votes
              1. [3]
                sparksbet
                Link Parent
                Yes, but the difference is that the local stations are generally quoting and crediting the NYT, and then potentially adding their own commentary. This is not the same thing as what a generative AI...

                Yes, but the difference is that the local stations are generally quoting and crediting the NYT, and then potentially adding their own commentary. This is not the same thing as what a generative AI model is doing (especially in Bing's case where they're alleged to remove attribution/links back to the NYT iirc?) and it could definitely be ruled that the way these AI models are trained on NYT articles and then essentially regurgitating the content when asked falls outside the bounds of fair use. It also could absolutely go the other way, though, this is a very unsettled area of law and fair use is deliberately very case-by-case anyway. I just think this suit has a better foundation than an authors suing them for training on their books because no one replaces reading a particular book with asking ChatGPT to generate a book (even if it can make one in the style of that author only because it trained on their work).

                11 votes
                1. [2]
                  DefinitelyNotAFae
                  Link Parent
                  And, those local stations also generally are paying for a news source like the AP or Reuters which is licensing that work (IDK about the Times). If the NYT does something newsworthy they'll report...

                  And, those local stations also generally are paying for a news source like the AP or Reuters which is licensing that work (IDK about the Times). If the NYT does something newsworthy they'll report on that too but they're not just rephrasing an article.

                  7 votes
                  1. sparksbet
                    Link Parent
                    Yeah I think it's a fundamentally different issue than another news source reporting the same story and more similar to a search engine displaying a summary of an article before you click on the...

                    Yeah I think it's a fundamentally different issue than another news source reporting the same story and more similar to a search engine displaying a summary of an article before you click on the link to the news site.

                    3 votes
          2. [4]
            trobertson
            Link Parent
            My understanding of the case is that the NYT have produced as evidence cases where ChatGPT has, very literally, copied their articles word-for-word, as a result of a prompt to get around the NYT...

            They own the license to ... those specific articles. Not any others.

            My understanding of the case is that the NYT have produced as evidence cases where ChatGPT has, very literally, copied their articles word-for-word, as a result of a prompt to get around the NYT paywall. If that evidence is legit and not doctored, then ChatGPT has blatantly plagiarized NYT articles. That is illegal, according to contemporary copyright law.

            As for this stuff:

            The modern attitude seems to be that the first caveperson who scratched a picture on the cave floor owns the concept of cave drawings.

            The Times doesn't own the concept of news, or product reviews.

            Oh, wait, I just violated William Gibson's copyright probably. He owns cyberpunk doesn't he? Or is it Mike Pondsmith? Philip K Dick? Bruce Sterling? Guess I'll find out when one of them sues me.

            Nobody is accusing them of plagiarizing a genre of writing. That's a nonsensical refutation of the case.

            20 votes
            1. [3]
              balooga
              Link Parent
              I would like to better understand this claim. I’m only familiar with the free version of ChatGPT so maybe there’s additional functionality available in the paid service that allows users to craft...

              My understanding of the case is that the NYT have produced as evidence cases where ChatGPT has, very literally, copied their articles word-for-word, as a result of a prompt to get around the NYT paywall. If that evidence is legit and not doctored, then ChatGPT has blatantly plagiarized NYT articles. That is illegal, according to contemporary copyright law.

              I would like to better understand this claim. I’m only familiar with the free version of ChatGPT so maybe there’s additional functionality available in the paid service that allows users to craft and execute a real HTTP request, and parse the response? Otherwise that’s beyond the purview of an LLM on its own.

              If that’s the case, that would mean the full article text is originating from that external request (i.e., the NYT’s own web server) and not from the GPT training data. That would categorize this action as paywall circumvention, not copyright infringement. I’m sure that’s frowned upon by the NYT, but it’s not actually illegal as far as I know.

              On the other hand, if the claim is that the full article text was scraped as part of the training data, I don’t understand why a specially crafted prompt would need to be used, to circumvent a paywall which wouldn’t exist in the model. IANAL.

              1 vote
              1. [2]
                vord
                Link Parent
                There is direct API access to these models, yes. That's essentially how all of the other "built on GPT-3/4" stuff works. But if I can hone ChatGPT to violate copyright by refining prompts, it kind...

                There is direct API access to these models, yes.

                That's essentially how all of the other "built on GPT-3/4" stuff works.

                But if I can hone ChatGPT to violate copyright by refining prompts, it kind of refutes the claims that the original training data is "lost" in training.

                5 votes
                1. Handshape
                  Link Parent
                  https://not-just-memorization.github.io/extracting-training-data-from-chatgpt.html Almost exactly a month ago, a team of Google researchers published a class of attack that reliably makes LLMs...

                  https://not-just-memorization.github.io/extracting-training-data-from-chatgpt.html

                  Almost exactly a month ago, a team of Google researchers published a class of attack that reliably makes LLMs poot slabs of training data back out.

                  In my professional life, I sounded the horn on what I called "localized overfit" in spring of this year. During my early experiments with some of the big commercial (and open-source) models, I was able to get them to emit chunks of their training sets verbatim, but my technique had a much lower success rate than Google's.

                  7 votes
          3. Eji1700
            Link Parent
            And herein lies the problem. We know humans analyze, and no, you wouldn't be owed something for interpreting someone else's works. You ARE owed something if I, for example, copy all your works and...

            Do professors owe writers royalties for sitting down and painstakingly analyzing a body of works?

            And herein lies the problem. We know humans analyze, and no, you wouldn't be owed something for interpreting someone else's works.

            You ARE owed something if I, for example, copy all your works and keep them in my basement, which is part of the argument that AI is doing, especially because they then use that to create value, arguably without any analyzing being done.

            15 votes
          4. [2]
            arghdos
            Link Parent
            I mostly agree, but then, the ethics of Meta et al purportedly using the “books3” database, which contains the entirety of bibliotik (book torrent site) to make a massive commercial operation is...

            I mostly agree, but then, the ethics of Meta et al purportedly using the “books3” database, which contains the entirety of bibliotik (book torrent site) to make a massive commercial operation is quite a bit different from your hypothetical professor doing a literature review. Honestly if it’s true, I wonder how their lawyers didn’t have a conniption.

            I generally think copyright is far too long and broadly enforced, but I have lots of bad feelings about large corporations using other people’s work without consent to create value for themselves. Particularly when they’re likely to have large influence over any regulatory frameworks, and use that to “pull up the ladder” after themselves. See: Sam Altman ‘begging’ for regulation this year

            Maybe if these datasets were under a Creative Commons non-commercial license (though how that would ever happen I have no idea), I’d be more supportive?

            14 votes
            1. nosewings
              Link Parent
              Their lawyers are likely well-aware of the potential risks. As always, the ethos is "Move fast and break things." In this case, the spectacle of a new technology and the adoption from major...
              • Exemplary

              I wonder how their lawyers didn’t have a conniption.

              Their lawyers are likely well-aware of the potential risks. As always, the ethos is "Move fast and break things." In this case, the spectacle of a new technology and the adoption from major industry players serves to create the impression of a fait accompli, which makes any judge less likely to rule against OpenAI in a country that views itself as an engine of enterprise and innovation.

              10 votes
          5. psi
            (edited )
            Link Parent
            I don't find this argument convincing since it ignores the scales involved. A human being cannot ingest hundreds of thousands of books, millions of news articles, and billions of internet comments...

            Do professors owe writers royalties for sitting down and painstakingly analyzing a body of works? Does a housewife who obsessively reads a genre, who begins taking notes, and then uses those notes and her experiences reading hundreds of those books to write one of her own owe royalties?

            I don't find this argument convincing since it ignores the scales involved. A human being cannot ingest hundreds of thousands of books, millions of news articles, and billions of internet comments and then spout off a summary of virtually any topic as quickly as that human being can physically communicate.

            But even if they could, if they pirated the materials, they would obviously owe the content creators money.

            Which brings me to my second point:

            The Times doesn't own the concept of news, or product reviews. They own the license to the specific articles on their site. Not articles "like" those. Not articles "similar" to those. Not articles "in the genre" of those. But, in fact, those specific articles. Not any others. And certainly not any that someone who might have analyzed the shit out of the Times before writing might have written.

            Even if we sidestep the issues I mentioned above and assume the theory of fair use you're suggesting, it neglects the other half of the Times's lawsuit: ChatGPT sometimes quotes the Times verbatim without citation or compensation. This a rather straightforward lawsuit alleging copyright infringement, especially if one can show that OpenAI profited from the infringement (e.g., by charging $20/mo for their product).

            11 votes
          6. CptBluebear
            Link Parent
            Well that's not entirely true now is it? Usually these fines happen because these tech companies do not follow the rules of privacy, not because of some arbitrary reasoning made up on the spot....

            They don't want to actually pass laws and revise their tax codes; they just want what they perceive as their cut. So they find a reason, levy charges, and demand payment

            Well that's not entirely true now is it? Usually these fines happen because these tech companies do not follow the rules of privacy, not because of some arbitrary reasoning made up on the spot. These laws and regulations are revisions based on emerging technology.

            Tax codes are constantly in flux too, I don't think that's happening the way you say it does.

            I won't argue that it's a nice chunk of change they're probably happily accepting, but they only happen because the tech companies violate the law. If they didn't, no fines would happen. See Meta's Threads: it took months to launch in Europe because Meta is afraid of the huge fines. And that's a good thing!

            8 votes
          7. [3]
            PancakeCats
            Link Parent
            I think the real issue with your take here, for me, is that you're treating AI like it's a human in these examples, with the implication that AI can actually interpret and make original texts,...

            I think the real issue with your take here, for me, is that you're treating AI like it's a human in these examples, with the implication that AI can actually interpret and make original texts, art, analysis etc. I admit I'm not super well versed with the most recent developments in AI scene, but at the end of the day, it's all just 1's and 0's, and I think it lacks the actual ability to truly create something.

            A human being will generally still put something of themselves into the art they produce, even if they obsessively analysed a book series before writing their own. AI cannot do that, as they have no experiences of their own to draw from, only the works of other humans being fed to it. Therefore, any perceived similarities go from fair use to theft pretty quick imo. I think the difference between a human who obsessively consumes media in an effort to produce their own work, and an AI obsessively consuming media to do a job it was asked to do, is a pretty big one that invites these discussions about whether this truly legal or not. Just my two cents.

            I'm certainly no expert though and I'm writing this just after waking up, so maybe it's all a bunch of hooplah. Forgive me if that's the case.

            7 votes
            1. vord
              (edited )
              Link Parent
              But also, humans are creating the AI. It's not like this is springing forth spontaneously from the cesspool of creation. The AI is an extension of what the programmers creating it are doing. There...

              But also, humans are creating the AI. It's not like this is springing forth spontaneously from the cesspool of creation.

              The AI is an extension of what the programmers creating it are doing. There is, somewhere in the cogs of the tech giants, a human that made the choice to include copywritten matetial and chose to ask forgiveness rather than permission, which gives a bigger advantage to incumbants. Which is part of why so many companies play fast and loose with the law...but that's a much bigger discussion.

              Which is not the wrong choice per-se...punishments tend to be miniscule and lets you help set the rules of the game rather than sitting down to ask permission.

              Like if Microsoft sat down with the publishing companies and said "lets work out copyright licensing for this," it almost certainly would lay the legal framework for forcing trainers to obey existing law, while being potentially much more costly than a punitive lawsuit.

              Especially wrt to Github and Copilot. Microsoft is deathly afraid of having to follow copyright and licensing rules for training models. They'll probably ban any copyleft license in short order from Github if that starts becoming the rule. They probably have or will have somewhere in their terms that you grant them a seperate license in perpetuity for data mining and AI training, as the cost of them hosting your code.

              3 votes
            2. winther
              (edited )
              Link Parent
              The issue becomes rather philosophical quickly. Because I agree that humans do more than simply reusing existing works by others. Even if you are heavily inspired by another artist, there is still...

              The issue becomes rather philosophical quickly. Because I agree that humans do more than simply reusing existing works by others. Even if you are heavily inspired by another artist, there is still a person that has a whole life worth of experiences, feelings and thoughts to add to the mix. AI models are still only built from vast amounts of data produced by others. Some would likely say that it is simply a matter of time before the AI models are fed with enough data to replicate a whole human life. That is a tough question to answer in the court room whether a humans personality can be reduced to just vast amounts of data.

              2 votes
        2. [3]
          Eji1700
          Link Parent
          Absolutely, but the whole thing is new ground, so there's a ton of cases to be made in all directions. We're not great with novel situations because previous case law doesn't map well at all, but...

          Would a good lawyer be able to make a convincing case out of this?

          Absolutely, but the whole thing is new ground, so there's a ton of cases to be made in all directions. We're not great with novel situations because previous case law doesn't map well at all, but that's often what we fall back on.

          In my ideal world, content creators and AI firms would figure out a solution to have the authors compensated fairly while also allowing models to scrape data freely, but that's never going to happen.

          It's not even likely possible. First you need to even identify who the original creator was for every piece of training material and track usage. Assuming you could somehow do both to any reasonable standard, what's a reasonable amount to be paid for your content influencing a result? Do you get paid if it doesn't appear in the result? What if it used your articles use of the word "the" rather than a free articles use of the word "the"? Don't they then just weight the entire system to get essentially the same answers without using non free content?

          And on and on and on. It's a massive rats nest that I don't think has a good answer even if everyone was on the same page of "yeah we want to pay them"

          6 votes
          1. [2]
            Jordan117
            Link Parent
            Maybe you're just simplifying things, but this really isn't how these models work at all. They're not mixing and matching pieces from a corpus of training data; in fact after training they make no...

            What if it used your articles use of the word "the" rather than a free articles use of the word "the"? Don't they then just weight the entire system to get essentially the same answers without using non free content?

            Maybe you're just simplifying things, but this really isn't how these models work at all. They're not mixing and matching pieces from a corpus of training data; in fact after training they make no reference to the training data at all.

            6 votes
            1. Eji1700
              Link Parent
              Yeah i'm simplifying a ton because even in the simplified state there's no easy solution. In the much more complex reality it gets even harder to figure out something reasonable other than "you...

              Yeah i'm simplifying a ton because even in the simplified state there's no easy solution.

              In the much more complex reality it gets even harder to figure out something reasonable other than "you shouldn't have made it viewable if you didn't want someone to be able to copy and profit from it"

              5 votes
        3. [2]
          vord
          Link Parent
          The NYT almost certainly has a well-staffed legal team. I really doubt they would attempt to file suit against some of the biggest money out there if they didn't think they had a good shot.

          Would a good lawyer be able to make a convincing case out of this? My layman brain seems to think so.

          The NYT almost certainly has a well-staffed legal team. I really doubt they would attempt to file suit against some of the biggest money out there if they didn't think they had a good shot.

          3 votes
          1. psi
            Link Parent
            In addition, filing a lawsuit gives them access to discovery (assuming the lawsuit isn't immediately dismissed, which it won't be). If OpenAI really did train on the "book3" dataset (a corpus...

            In addition, filing a lawsuit gives them access to discovery (assuming the lawsuit isn't immediately dismissed, which it won't be). If OpenAI really did train on the "book3" dataset (a corpus containing a couple hundred thousand pirated books), that is the sort of fact that could come out of discovery.

            4 votes
      2. qob
        Link Parent
        Does the same apply if I use a browser extension that removes referal links? Who is the guilty party in that case, me or the extension creator? And what if I don't use the extension and remove the...

        Does the same apply if I use a browser extension that removes referal links? Who is the guilty party in that case, me or the extension creator? And what if I don't use the extension and remove the referal codes manually from the URL?

        I'm not taking any side. I hope they both lose somehow. To me, this just shows how broken the copyright industry and the ad-driven internet economy is. One side pretends they can produce completely original content after millenia of shared thoughts and ideas, and the other side is a bunch of rich people who invest their pocket money to become even richer.

        2 votes
  2. [3]
    psi
    Link
    Personally I doubt that OpenAI's use of material could be considered fair use. However, it also seems obvious to me that the cat's already out of the bag. And so I don't think it would be wise to...
    • Exemplary

    Personally I doubt that OpenAI's use of material could be considered fair use.

    However, it also seems obvious to me that the cat's already out of the bag.

    And so I don't think it would be wise to ban all text-based generative AI tomorrow on the basis that these projects are large-scale acts of copyright infringement (in my opinion, visual and audio generative AI are a separate matter). For one, even if we were to ban them, they would still exist online: llama and the like and have been widely disseminated. Moreover, a ban in the western world won't stop China, Russian, and other authoritarian states from training their own models.

    But more importantly, at this nascent stage of generative AI's development, we have no idea how useful this technology could be. It could potentially herald a new scientific revolution. Or it could potentially lead to an age of misinformation that will send us back to the dark ages. But it would be a shame to kill a potentially world-altering project prematurely on the basis that it engaged in copyright infringement.

    I think discussions about generative AI present a false dichotomy that pits technology against content creators. We shouldn't have to pick a side -- we should be trying to imagine what a solution looks like that balances the interests of both groups as well as the public interest. I don't know what those laws would look like, but in the interest of tearing down this false dichotomy, I would like to suggest some principles.

    1. There should be separate regulations for visual, audio, and textual media. For example, text-based generative content is more likely to be "useful" compared to visual content, which is more likely to be "fun" compared to textual content. Correspondingly, for visual content the equities should be balanced more favorably towards the content creators.
    2. Similarly, if your goal is to build a "useful" text-based generative AI model, you don't need fiction from the last few decades. If you insist on including fiction that isn't in the public domain, the equities should be balanced in favor of the artists.
    3. When considering non-fiction, we should balance the equities towards the public interest. But at a bare minimum, these generative models should attempt to cite their sources.
    4. To mitigate the economic conflict between content creators and technology owners, the organization owning the technology should either be publicly-owned or non-profit.

    What other principles would you folk suggest? Or do you think this balancing game is a fundamentally hopeless enterprise?

    6 votes
    1. [2]
      nosewings
      Link Parent
      Funny, this is nearly exactly what I said OpenAI's meme-space strategy is.
      1 vote
      1. psi
        Link Parent
        I'm pretty sure Microsoft/OpenAI's position is that everything's fair use and they shouldn't have to compensate anyone, which isn't what I'm advocating for at all.

        I'm pretty sure Microsoft/OpenAI's position is that everything's fair use and they shouldn't have to compensate anyone, which isn't what I'm advocating for at all.

        4 votes
  3. [6]
    Sodliddesu
    Link
    I'm no lawyer, so this may be standard practice but Don't lawsuits usually come with a dollar amount? Can you just say "I'm suing you for... I'll figure it out later" or is this because it's so...

    I'm no lawyer, so this may be standard practice but

    The suit does not include an exact monetary demand. But it says the defendants should be held responsible for “billions of dollars in statutory and actual damages” related to the “unlawful copying and use of The Times’s uniquely valuable works.”

    Don't lawsuits usually come with a dollar amount? Can you just say "I'm suing you for... I'll figure it out later" or is this because it's so difficult to quantify the losses (though the verbatim text is pretty easy to prove that they're at least coping that)?

    Still, can't help but wonder if this'll be an AvP situation or if this is going to be what causes the Government to accidentally rule in favor of regulation.

    6 votes
    1. [2]
      kru
      Link Parent
      Lawsuits don't always have to result in monetary penalty. When money changes hands to settle a suit, its called Equitable Relief, and it is common because it's easy to do. Figure out the value of...

      Lawsuits don't always have to result in monetary penalty. When money changes hands to settle a suit, its called Equitable Relief, and it is common because it's easy to do. Figure out the value of damage has been caused and then have the aggrieved party made whole via cash. There are other forms of relief, such as Specific Performance, where you're suing to ask the court to force the other party to do some specific action, usually following the terms of a contract. In this case, the NYT is probably looking for injunctive relief, which is asking the court to just force Microsoft and OpenAI to stop using any product made with their stuff.

      18 votes
      1. vord
        Link Parent
        Which probably will cause untold monetary damages, because to properly do that they'd probably have to retrain the models from ground 0.

        Which probably will cause untold monetary damages, because to properly do that they'd probably have to retrain the models from ground 0.

        1 vote
    2. [2]
      updawg
      Link Parent
      It's especially interesting because this is what I get when I ask ChatGPT 4 to summarize the NYT's article on a specific newsworthy event that I know is within even GPT-3's archive time:...

      It's especially interesting because this is what I get when I ask ChatGPT 4 to summarize the NYT's article on a specific newsworthy event that I know is within even GPT-3's archive time:

      https://chat.openai.com/share/f8d1e3a3-883c-4f6b-b499-fbc901ed0206

      7 votes
      1. arqalite
        Link Parent
        I think OpenAI definitely put safeguards in place to prevent users from directly requesting information from major publications. I'm sure you can engineer a prompt to give you stuff from the...

        I think OpenAI definitely put safeguards in place to prevent users from directly requesting information from major publications. I'm sure you can engineer a prompt to give you stuff from the article (verbatim, even) because this stuff is never perfect.

        AFAIK we don't know the exact details of the model used by Bing, but I'm inclined to believe it has different restrictions and instructions than ChatGPT, so it will behave much differently. Unsure if it's Microsoft or OpenAI that makes the decisions on which prompts to block on Bing, but it might be Microsoft, so unless they approve it, OpenAI can't change Bing's behavior.

        9 votes
    3. sparksbet
      Link Parent
      Generally speaking the specifics of the monetary side of things would be determined after it's determined that there actually was a copyright violation, so I don't think it's particularly strange...

      Generally speaking the specifics of the monetary side of things would be determined after it's determined that there actually was a copyright violation, so I don't think it's particularly strange for them to go with a vague "billions" atm. Just saying you're seeking statutory and actual damages in such and such ballpark seems fairly normal at this early stage, especially since the amount for statutory damages could vary by huge orders of magnitude depending on how the court characterizes a single infringing use when it comes to training a generative AI model -- this isn't a well-settled area so if OpenAI is found to be on the hook for copyright infringement how the court calculates damages is gonna be fascinating.

      6 votes
  4. DonQuixote
    Link
    The larger story is the breakdown of our current economic system with absolutely nothing better to take its place.

    The larger story is the breakdown of our current economic system with absolutely nothing better to take its place.

    3 votes
  5. patience_limited
    (edited )
    Link
    TechDirt analysis here, from someone who's pretty knowledgeable about copyright law. Masnick makes the case that even if OpenAI queries can be constrained to the point that original training...

    TechDirt analysis here, from someone who's pretty knowledgeable about copyright law.

    Masnick makes the case that even if OpenAI queries can be constrained to the point that original training material spews out by the paragraph, it's not really different from twisting a student's arm until they yield both their homework and all their textbooks.

    There are important points here:

    Part of the Times complaint is that OpenAI’s GPT LLM was trained in part with Common Crawl data. Common Crawl is an incredibly useful and important resource that apparently is now coming under attack. It has been building an open repository of the web for people to use, not unlike the Internet Archive, but with a focus on making it accessible to researchers and innovators. Common Crawl is a fantastic resource run by some great people (though the lawsuit here attacks them).

    But, again, this is the nature of the internet. It’s why things like Google’s cache and the Internet Archive’s Wayback Machine are so important. These are archives of history that are incredibly important, and have historically been protected by fair use, which the Times is now threatening.

    (Notably, just recently, the NY Times was able to get all of its articles excluded from Common Crawl. Otherwise I imagine that they would be a defendant in this case as well).
    .
    .
    .
    Anyway, the larger point is that if the NY Times wins, well… the NY Times might find itself on the receiving end of some lawsuits. The NY Times is somewhat infamous in the news world for using other journalists’ work as a starting point and building off of it (frequently without any credit at all). Sometimes this results in an eventual correction, but often it does not.

    If the NY Times successfully argues that reading a third party article to help its reporters “learn” about the news before reporting their own version of it is copyright infringement, it might not like how that is turned around by tons of other news organizations against the NY Times. Because I don’t see how there’s any legitimate distinction between OpenAI scanning NY Times articles and NY Times reporters scanning other articles/books/research without first licensing those works as well.

    Or, say, what happens if a source for a NY TImes reporter provides them with some copyright-covered work (an article, a book, a photograph, who knows what) that the NY Times does not have a license for? Can the NY Times journalist then produce an article based on that material (along with other research, though much less than OpenAI used in training GPT)?

    1 vote