They seem to be going for an easier case here (compared to the harder case of arguing that LLM companies are engaging in copyright infrigement by using copyrighted content to train their models...
They seem to be going for an easier case here (compared to the harder case of arguing that LLM companies are engaging in copyright infrigement by using copyrighted content to train their models without permission), which is when the model spits out regurgitates copyrighted content verbatim.
Universal and two other music companies allege that Anthropic scrapes their songs without permission and uses them to generate “identical or nearly identical copies of those lyrics” via Claude, its rival to ChatGPT.
If they win my guess is this will throw a spanner in the works of a lot of LLM projects because there will always be the threat that media companies will sue for copyright infringement.
It seems that this should be avoidable by various technical means, without interfering with the LLM itself. For example, a set of fingerprints can be generated for the token strings (pieces of...
It seems that this should be avoidable by various technical means, without interfering with the LLM itself. For example, a set of fingerprints can be generated for the token strings (pieces of lyrics) used during training. Then LLM ouput can be compared with these fingerprints, and ouput that is too similar in the legal sense can be discarded.
Too similar in a legal sense is the multi million dollar question/problem. There’s always a ton of awkward grey area with similar sounding songs, and where creativity begins and copyright ends,...
Too similar in a legal sense is the multi million dollar question/problem.
There’s always a ton of awkward grey area with similar sounding songs, and where creativity begins and copyright ends, and that’s before you can legally prove beyond a shadow of a doubt that the possible infringed material was used in the training model
They seem to be going for an easier case here (compared to the harder case of arguing that LLM companies are engaging in copyright infrigement by using copyrighted content to train their models without permission), which is when the model spits out regurgitates copyrighted content verbatim.
If they win my guess is this will throw a spanner in the works of a lot of LLM projects because there will always be the threat that media companies will sue for copyright infringement.
It seems that this should be avoidable by various technical means, without interfering with the LLM itself. For example, a set of fingerprints can be generated for the token strings (pieces of lyrics) used during training. Then LLM ouput can be compared with these fingerprints, and ouput that is too similar in the legal sense can be discarded.
Too similar in a legal sense is the multi million dollar question/problem.
There’s always a ton of awkward grey area with similar sounding songs, and where creativity begins and copyright ends, and that’s before you can legally prove beyond a shadow of a doubt that the possible infringed material was used in the training model
Ah yes, an unstoppable force (tech) meets and immovable object (music) lmao
Link without paywall: https://archive.ph/LcdWl