11 votes

An experiment to test GitHub Copilot's legality

22 comments

  1. [17]
    Macil
    Link
    I don't think the concept of finding Copilot generally legal or illegal to use makes sense. It's a tool and intent would be judged in cases using it. I'd expect most snippets it spits out even if...

    I don't think the concept of finding Copilot generally legal or illegal to use makes sense. It's a tool and intent would be judged in cases using it. I'd expect most snippets it spits out even if they are rote copies aren't significant enough to be copyrighted, but someone that prompts it to get a significant rote output and passes it off as their own work will probably get in trouble. It probably would be a good improvement though for Copilot to detect when it might be too closely matching a single instance of some code and link the user to the source of the code, so the user could make the judgment about whether the result was safe to use in a similar way the user would judge whether to use some code they found from googling.

    I don't really relate to the goal of trying to stop Copilot. What I like about open source is sharing and enabling things like Copilot. I mostly prefer permissive licenses like MIT, though I've considered GPL for more weighty things, but even then I don't think a significant part of the value of my work comes from snippet-sized bits that Copilot might copy or rework. Seeing developers territorial over that makes me imagine if an author went on the warpath to make sure nobody knowingly or not ever used the same combination of words from a sentence as they did ever.

    7 votes
    1. [16]
      mtset
      Link Parent
      The problem is that Microsoft's contention is that training a neural network on something is fair use, so the licencing is literally irrelevant. For their theory, source-available software that...

      The problem is that Microsoft's contention is that training a neural network on something is fair use, so the licencing is literally irrelevant. For their theory, source-available software that says "you may not reproduce this source code" can (and is) included just like MIT. If that's proven true in court, big companies will be able to launder the copyright of just about anything. It'll be copyright for me but not for thee.

      11 votes
      1. [4]
        Macil
        (edited )
        Link Parent
        It could be fair use to train a neural net on code, but not necessarily fair use in all cases for a user of it to pass off the output of the neural net as their work. Conflating these is like...

        It could be fair use to train a neural net on code, but not necessarily fair use in all cases for a user of it to pass off the output of the neural net as their work. Conflating these is like trying to determine if Google results are legal to use in projects just because Google itself is legal.

        6 votes
        1. FlippantGod
          Link Parent
          I think the issue is that copilot is capable of outputting code effectively under a copyleft license and does not retain the license which is against the license's terms. So long as the end user...

          I think the issue is that copilot is capable of outputting code effectively under a copyleft license and does not retain the license which is against the license's terms. So long as the end user publishes (uses?) the code without appending the correct license(s), the end user and/or the neural network itself violated the license.

          If the fault would lie with the user for not manually checking for violations, then copilot would be rendered useless.

          If the fault lies with Microsoft, then they have to change how copilot works.

          6 votes
        2. mtset
          Link Parent
          That's a really good analogy!

          That's a really good analogy!

          5 votes
        3. vord
          Link Parent
          But.. isn't that basically the entire point of why it was created? To have the AI populate as much code for you as possible? It's like saying "Don't drink the coffee" after handing the user a coffee.

          But.. isn't that basically the entire point of why it was created? To have the AI populate as much code for you as possible?

          It's like saying "Don't drink the coffee" after handing the user a coffee.

          2 votes
      2. [11]
        teaearlgraycold
        Link Parent
        I think they’re right. The license is irrelevant unless it was never meant to be publicly available. It would be like claiming knowledge gained from reading public code is licensed information and...

        I think they’re right. The license is irrelevant unless it was never meant to be publicly available. It would be like claiming knowledge gained from reading public code is licensed information and you need to consult the license to derive content from it.

        The difference here is copilot doesn’t understand what it’s doing like a human would. But the end result could be exactly the same as a human writing code from learnings based on open source code.

        I see this as a step towards the destruction of software IP. And I’m an anarchist when it comes to software IP. People and companies will still write software without the protection of copywrite. They’ll just have to accept it’s all defacto free software. Different software will get written. One set of problems exchanged for another.

        3 votes
        1. [5]
          helloworld
          Link Parent
          Unfortunately no, it will not be 'de facto free'. All the open source and libre code will be laundered, as that is in the training set. Proprietary software will not be, as Windows and Office code...

          Unfortunately no, it will not be 'de facto free'.

          All the open source and libre code will be laundered, as that is in the training set. Proprietary software will not be, as Windows and Office code is not in training set.

          In other words, corporations will be able to pawn off works of volunteers, but their own codebases will remain behind iron curtain, never contributing back to the world from which they benefit.

          4 votes
          1. [4]
            vektor
            Link Parent
            Meanwhile, these same companies will keep these models close to their chest. These models will not be released under the same licensing terms as the data that was used to derive them. IMHO, they...

            Meanwhile, these same companies will keep these models close to their chest. These models will not be released under the same licensing terms as the data that was used to derive them. IMHO, they are derivative works of GPL software, and should be treated as such. The GPL applies. Give me my copy of Copilot!

            State-of-the-art neural networks can be used to arbitrarily whitewash any input data. If we treat the outputs any different than we would the inputs, this would blow a very inconvenient hole into copyright. Imagine if reading and writing (i.e. copying) a file (in a complicated way) would be considered your own work. Imagine the field day patent trolls would have with this. This legal loophole gives power to large corporations to fuck with FOSS code as they please, while giving nothing back to FOSS.

            ML models, as they exist today, should be treated as copying machines unless proven otherwise. That means the models and its outputs are subject to the copyright of the input data. If there exists an exception of e.g. fair use, then that copyright might not apply, but such exceptions have to actually exist. EU copyright law has some provisions for data mining and such that can be used. Where they don't apply? Tough luck. Play by the same rules as the rest of us!

            6 votes
            1. [3]
              skybrian
              Link Parent
              This seems to be about "transformative use" not "fair use." Fair use is about quoting small excerpts. For example, if you're writing a book review you can quote a long passage from the book,...

              This seems to be about "transformative use" not "fair use." Fair use is about quoting small excerpts. For example, if you're writing a book review you can quote a long passage from the book, giving appropriate credit. Or consider writing a search engine that displays small snippets of the original in search results. Those are partial copies, but they are allowed.

              Transformative use is more like when you read something and rewrite it in your own words. If you don't copy then that actually does get rid of the copyright violation and you don't need a license. It might still be patent infringement.

              (But I'm not a lawyer. Also, I wonder what is the relationship between "transformative use" and "derivative works?")

              1 vote
              1. [2]
                vektor
                Link Parent
                From what you say, I think you could argue fair use or transformative use, depending on how you view the outputs of Copilot. Personally, I'm very firmly in camp fair use (again, if I understand...

                From what you say, I think you could argue fair use or transformative use, depending on how you view the outputs of Copilot. Personally, I'm very firmly in camp fair use (again, if I understand you and US copyright law correctly...) as I don't think any ML model these days has the capacity to truly "rewrite it in your own words", at least not reliably. There's no reasoning behind it, no understanding of semantics, just syntactic patterns. That's not an actual transformation, that's just a copy.

                Maybe when models have advanced substantially: Proper reasoning and smaller models - small enough you can't overfit them to a significant fraction of your training set. Then we can talk about transformative use again.

                2 votes
                1. skybrian
                  Link Parent
                  Can it help the programmer "write in their own words" though? I mean, it's autocomplete. We generally don't think of ourselves as plagiarizing when we use autocomplete, even though it has been...

                  Can it help the programmer "write in their own words" though? I mean, it's autocomplete. We generally don't think of ourselves as plagiarizing when we use autocomplete, even though it has been trained on lots of other people's text.

                  2 votes
        2. [4]
          vord
          (edited )
          Link Parent
          Pretty sure I can't train an AI on Google Maps data and then ask that AI questions to populate Open Street Map. That would be a violation of Google's licensing of ther Google Maps data. Copilot is...

          I think they’re right. The license is irrelevant unless it was never meant to be publicly available.

          Pretty sure I can't train an AI on Google Maps data and then ask that AI questions to populate Open Street Map. That would be a violation of Google's licensing of ther Google Maps data.

          Copilot is doing the same. I'd wager most of the internal neural network ended up being 'search title and commemts for matching pattern, and spit out generic form of it.' It's training on data, regardless of the source's license, and then powering a tool explicitly designed to pump out generic forms of that code. The only grey area is that code developers likely didn't think of this use case decades ago.

          They might not be violating the letter of the law, but they sure as hell are violating the spirit. If they wanted to go above-board, it would be an opt-in feature, only accessible to nonderivative works...giving time for devs to modify their licenses accordingly (to ban or allow it).

          Tell you what MS, open source Office and Windows under MIT and maybe I'll believe for an instant that you believe your legal arguements if they had to apply to you.

          4 votes
          1. [3]
            skybrian
            Link Parent
            That's a different task, though. To fix up the analogy, if you wanted to create maps of fantasy worlds then you might be able to train it on Google maps data, since you're not copying. The output...

            That's a different task, though. To fix up the analogy, if you wanted to create maps of fantasy worlds then you might be able to train it on Google maps data, since you're not copying. The output is arguably a transformative use.

            But if the fantasy worlds sometimes ended up with regions that look just like real places, you might be in trouble.

            As far as the "spirit of the law" goes, writers often learn things by reading other people's books. You are allowed to write your own books that are based on other books you read, as long as you use your own words (or quote accurately, under fair use).

            That's what these systems are trying to do. They're not trying to copy code word-for-word, they're trying to learn from it. However, sometimes they don't work right.

            2 votes
            1. [2]
              vord
              Link Parent
              Is it really? The AI isn't learning. It's a fancy I/O filter. If I train the AI on a single complicated code base, it'll only spit out variations from that code base. The only difference is that...

              That's a different task, though

              Is it really? The AI isn't learning. It's a fancy I/O filter. If I train the AI on a single complicated code base, it'll only spit out variations from that code base.

              The only difference is that feeding it millions of examples just lets it spit out millions of variations and not include the original sources, making it harder to trace.

              It's the equivalent of high school essay cheating by copying an article verbatim than running it through a thesaurus.

              2 votes
              1. skybrian
                Link Parent
                From the examples I've seen, I don't think it's that simple.

                From the examples I've seen, I don't think it's that simple.

                1 vote
        3. mtset
          Link Parent
          Your are much more optimistic than I when it comes to equal application of the law, then. People do that, at least with patented code; hence the practice of clean-room development.

          I see this as a step towards the destruction of software IP. And I’m an anarchist when it comes to software IP.

          Your are much more optimistic than I when it comes to equal application of the law, then.

          It would be like claiming knowledge gained from reading public code is licensed information and you need to consult the license to derive content from it.

          People do that, at least with patented code; hence the practice of clean-room development.

          4 votes
  2. [2]
    Comment deleted by author
    Link
    1. skybrian
      (edited )
      Link Parent
      It seems like it would be fairly difficult to write any code in this way that infringes on Microsoft's copyright in a way they care about? For the most part, it would just look like random code...

      It seems like it would be fairly difficult to write any code in this way that infringes on Microsoft's copyright in a way they care about? For the most part, it would just look like random code that conforms to Microsoft's internal style. Writing code in the same style as Microsoft isn't a copyright violation.

      More generally, you're making a consistency argument. I think many people are often inconsistent about copyright and they hardly even notice. (I won't say hypocritical because I'm hardly pure here, and besides, the arguments are often made by different people.)

      For example, I think most people know that Sci-Hub is illegal, but they also don't mind that kind of copyright violation very much, even though it's verbatim copying of articles. We can distinguish between what's illegal and what's immoral. From a moral perspective, most people think it's kind of cool to violate copyright in this way, because a free repository of scientific papers should exist. They don't agree with the law and it's considered civil disobedience.

      On the other hand, consider Google Books. That was doing wholesale copying of books and that part got shut down. Now it's just a book search engine, which will show pages from books (under fair use, I guess) but not the whole book. It seems like it's a tragedy that no "free library in the cloud" exists? But people mostly decide whether it's okay or not (morally, rather than legally) based on who was doing it.

      Also, just today someone posted a fork of a tool for downloading YouTube videos, which, although it can have legal purposes, is mostly a tool for copyright violation. (Not to mention that many videos uploaded to YouTube are copyright violations, and we watch them anyway.)

      Anyone going around saying that you should stop copying music because it's morally wrong would probably be considered a prig, as lampooned in Weird Al's Don't Download This Song.

      Even on Tildes, we often link to Twitter loop unrolling services and news archiving sites to avoid paywalls.

      And I, personally, often include long quotes from articles. I don't quote the whole thing (I think that would be wrong) but it's pushing the limits. (I won't make any claim as to when it goes too far.)

      And I never get pushback for that for legal or moral reasons. Nobody says I should stop doing it to avoid possibly getting Tildes into legal trouble, or that it's morally wrong for me to make long quotes from copyrighted articles. The only pushback I've occasionally gotten is from people who think that long quotes don't add anything to the conversation.

      So I get that some people feel strongly about CoPilot possibly copying their code. The emotions are real. But I wonder if it's part of a coherent philosophy, or if it's just based on who's doing it. Meanwhile we're all looking for useful code snippets from the Internet, from Stack Overflow or GitHub or elsewhere. It's rarely copied as-is (for stylistic reasons if nothing else) but it seems as likely to be a "violation" than what CoPilot does?

      Incidentally, there is apparently a setting in CoPilot for "duplicate detection." From the FAQ it says:

      We built a filter to help detect and suppress the rare instances where a GitHub Copilot suggestion contains code that matches public code on GitHub. You have the choice to turn that filter on or off during setup. With the filter on, GitHub Copilot checks code suggestions with its surrounding code for matches or near matches (ignoring whitespace) against public code on GitHub of about 150 characters. If there is a match, the suggestion will not be shown to you. We plan on continuing to evolve this approach and welcome feedback and comment.

      It seems like this should be on all the time? (And that quote is more than 150 characters. I hope I don't get into trouble!)

      4 votes
  3. [4]
    Eric_the_Cerise
    Link
    Just as a side-note, I kind of have a problem with someone writing a long, detailed, potentially viable proposal to set legal precedent, and then prefacing the whole thing with an extensive "this...

    Just as a side-note, I kind of have a problem with someone writing a long, detailed, potentially viable proposal to set legal precedent, and then prefacing the whole thing with an extensive "this is supposed to be satire" content warning. If you feel that need to "explain the joke", you probably aren't telling it right.

    And yeah, I know, satire and the Internet don't seem to mix well, but as Al Franken and others have noted, it's most often the object of the satire that "doesn't get it". In this case, I'm hard-pressed to even identify the object of the satire.

    4 votes
    1. [2]
      vektor
      Link Parent
      Yeah, that stood out to me as well. What parts of it are satire? Why would that be wrong? Dude seems to have a strong opinion about what's correct and false here, but doesn't say. I get that the...

      Yeah, that stood out to me as well. What parts of it are satire? Why would that be wrong? Dude seems to have a strong opinion about what's correct and false here, but doesn't say.

      I get that the attempt at creating precedent would probably not work, but would the author still consider it a reasonable illustration of the problem? The "either it's valid and we can whitewash copyrighted material this way, or it isn't, and the GPL of works in the training set holds" part I mean.

      3 votes