39 votes

Can coding agents relicense open source through a “clean room” implementation of code?

16 comments

  1. [4]
    teaearlgraycold
    Link
    I don’t think you can call this a clean room implementation. The original code was almost certainly fed into the LLMs as training data.

    I don’t think you can call this a clean room implementation. The original code was almost certainly fed into the LLMs as training data.

    20 votes
    1. rich_27
      Link Parent
      What I got from the article was that it wasn't claiming to be a clean room implementation, just that it was claiming to be a verifiably distinct work

      What I got from the article was that it wasn't claiming to be a clean room implementation, just that it was claiming to be a verifiably distinct work

      6 votes
    2. [2]
      d32
      Link Parent
      Yes. Hard (impossible?) to prove though. Additionally, from the committed logs , the coding agent had at least some access to the original sources.

      Yes. Hard (impossible?) to prove though.
      Additionally, from the committed logs , the coding agent had at least some access to the original sources.

      4 votes
  2. [3]
    skybrian
    Link
    From the article: [...]

    From the article:

    chardet was created by Mark Pilgrim back in 2006 and released under the LGPL. Mark retired from public internet life in 2011 and chardet’s maintenance was taken over by others, most notably Dan Blanchard who has been responsible for every release since 1.1 in July 2012.

    Two days ago Dan released chardet 7.0.0 with the following note in the release notes:

    Ground-up, MIT-licensed rewrite of chardet. Same package name, same public API — drop-in replacement for chardet 5.x/6.x. Just way faster and more accurate!

    Yesterday Mark Pilgrim opened #327: No right to relicense this project:

    [...]

    There are several twists that make this case particularly hard to confidently resolve:

    • Dan has been immersed in chardet for over a decade, and has clearly been strongly influenced by the original codebase.

    • There is one example where Claude Code referenced parts of the codebase while it worked, as shown in the plan—it looked at metadata/charsets.py, a file that lists charsets and their properties expressed as a dictionary of dataclasses.

    • More complicated: Claude itself was very likely trained on chardet as part of its enormous quantity of training data—though we have no way of confirming this for sure. Can a model trained on a codebase produce a morally or legally defensible clean-room implementation?

    • As discussed in this issue from 2014 (where Dan first openly contemplated a license change) Mark Pilgrim’s original code was a manual port from C to Python of Mozilla’s MPL-licensed character detection library.

    • How significant is the fact that the new release of chardet used the same PyPI package name as the old one? Would a fresh release under a new name have been more defensible?

    16 votes
    1. [2]
      papasquat
      Link Parent
      I don't understand how it could possibly be argued that it could. A human being who has seen the original implementation cannot produce a legally defensible clean room implementation. How could...

      More complicated: Claude itself was very likely trained on chardet as part of its enormous quantity of training data—though we have no way of confirming this for sure. Can a model trained on a codebase produce a morally or legally defensible clean-room implementation?

      I don't understand how it could possibly be argued that it could. A human being who has seen the original implementation cannot produce a legally defensible clean room implementation. How could you possibly argue that a machine explicitly designed to encode the original training data within its weights as accurately as possible that has been trained on the original implementation is designing something novel?

      At that point, what's the difference between using an LLM and just photocopying the source code?

      11 votes
      1. skybrian
        Link Parent
        That's going too far. The "clean room" stuff is a sideshow. Nobody is required to do a clean room implementation to avoid copyright infringement. Another way to decide this is to compare the two...

        That's going too far. The "clean room" stuff is a sideshow. Nobody is required to do a clean room implementation to avoid copyright infringement. Another way to decide this is to compare the two codebases and see how similar or different they are. The trouble is, that's more of a judgement call.

        I think they should offer to rewrite anything that's too close to the original.

        4 votes
  3. Chiasmic
    Link
    I think the name staying the same is really significant- it’s the same project in everyone’s mind. It would have been more defensible (but even then still not defensible in my mind) if the name...

    I think the name staying the same is really significant- it’s the same project in everyone’s mind. It would have been more defensible (but even then still not defensible in my mind) if the name had changed.

    10 votes
  4. [3]
    sparksbet
    Link
    This is actually super fascinating to me from a copyright law perspective, which means aiming in the unfortunate circumstance of really wanting someone to sue someone and go to trial so that this...

    This is actually super fascinating to me from a copyright law perspective, which means aiming in the unfortunate circumstance of really wanting someone to sue someone and go to trial so that this can be litigated publicly, even though that's extremely unlikely here.

    9 votes
    1. [2]
      d32
      Link Parent
      As the author notes, this approach is bound to be replicated in a higher stakes corporate environment ala the 80’s bios case they reference. I think we can look forward to litigation .

      As the author notes, this approach is bound to be replicated in a higher stakes corporate environment ala the 80’s bios case they reference. I think we can look forward to litigation .

      7 votes
      1. sparksbet
        Link Parent
        True but in a higher-stakes corporate environment we also have a much higher risk of settlement, which usually entails I don't get interesting decisions to read 😞

        True but in a higher-stakes corporate environment we also have a much higher risk of settlement, which usually entails I don't get interesting decisions to read 😞

        8 votes
  5. qob
    Link
    Copyright has been dead since copying data became free. It is a dead horse and we keep riding it. Engineers are forced to dance around legal intricacies when they could do actual work. They would...

    Copyright has been dead since copying data became free. It is a dead horse and we keep riding it. Engineers are forced to dance around legal intricacies when they could do actual work. They would literally be more productive if they played with a puppy or watched wallpaper dry.

    My apologies to any law nerds who enjoy this sort of stuff. But please don't force normal people to play your weird games.

    6 votes
  6. [4]
    vord
    Link
    This is why the MIT/BSD adherants drive me nuts. They feel entitled to build on the backs of others while not giving their users the same benefit the original author wanted. And that their...

    This is why the MIT/BSD adherants drive me nuts. They feel entitled to build on the backs of others while not giving their users the same benefit the original author wanted. And that their entitlement justifies trying to undermine GPL projects as much as possible.

    And the kicker is that this isn't even the strong GPL or AGPL where the consequences of embedding are riskier. It's the LGPL, which only covers the modifications to the library itself, not the app that uses it. Literally all you'd have to do is have a customer-facing repo with just the changes to the library.

    The BSD's are as old as Linux. Yet Linux has come to dominate vast swaths of computing stacks, large to small in a way the BSDs did not. It's not a coincidence.

    2 votes
    1. unkz
      Link Parent
      I consider myself an MIT/BSD adherent, and I don’t quite understand your position. When I release permissively licensed code, I want my users to have near total freedom to use it. That’s why I...

      This is why the MIT/BSD adherants drive me nuts. They feel entitled to build on the backs of others while not giving their users the same benefit the original author wanted.

      I consider myself an MIT/BSD adherent, and I don’t quite understand your position. When I release permissively licensed code, I want my users to have near total freedom to use it. That’s why I dislike the GPL, because it is too constraining.

      1 vote
    2. [2]
      skybrian
      (edited )
      Link Parent
      The GPL is a licensing hack to get companies to release their source code, from back when source code was scarce and difficult to write. But where we're going, maybe you won't need their source...

      The GPL is a licensing hack to get companies to release their source code, from back when source code was scarce and difficult to write. But where we're going, maybe you won't need their source code? Why copy when you can start fresh with less effort, leaving behind their legacy cruft?

      Open source is going to be different in a world of source code abundance. High-quality open source projects with a trusted brand will still be used more, particularly because AI knows how to use them. But any restrictive license, commercial or not, is just a speed bump that the Internet will route around with a not-fork. Maybe they'll also rewrite it in Rust?

      This could be seen as a more radical version of the free software movement that snowballs everything before it with a big ball of freely licensed slop. The kids will wonder why we ever cared about licenses. It's unclear if you can copyright machine-generated source code anyway.

      1. unkz
        Link Parent
        Brilliant! Pretty hard to complain about that as a copyright violation I suspect.

        Maybe they'll also rewrite it in Rust?

        Brilliant! Pretty hard to complain about that as a copyright violation I suspect.

        2 votes