I don't think this database clears my threshold for a clean LLM. The "openly licensed" section of the paper requires that the source's license must allow for use, retribution, modification,...
Exemplary
I don't think this database clears my threshold for a clean LLM.
The "openly licensed" section of the paper requires that the source's license must allow for use, retribution, modification, separation (the work is allowed to be used in smaller parts), compilation (The Pile is a compilation of openly licensed data) for any purpose without charges. Which they wrote that CC-BY, CC-BY-SA and the MIT license are part of it. However, those license requires that the the distribution of the work must retain the copyright & license statement, in which the paper wrote
Finally, we note that while it is relatively straightforward to obey
attribution requirements when redistributing data, attributing model predictions back to the training
data points that influenced them remains an active area of research
So, if you use the model and it happen to generate MIT-licensed code, or a CC-BY content you're still required to check which correct license notice to add.
Do you mean that you don't think any LLM could be cleanly trained off this dataset, or that proper implementation is required to ensure the LLM remains clean beyond simply training off this...
Do you mean that you don't think any LLM could be cleanly trained off this dataset, or that proper implementation is required to ensure the LLM remains clean beyond simply training off this dataset and that the paper authors have not created clean LLMs here?
I can imagine an implementation that compares the output against the training set and provides proper attribution in case of duplication, that seems clean to me at first thought, though not what these folks did.
There's a lot I don't know about licensing, I mean to further understand, not to antagonize your position.
I think the legal issues has 3 parts Whether training data using copyrighted data is allowed. Currently the Big AI are trying to say that it is fair use (see the NYT vs OpenAI case). This...
I think the legal issues has 3 parts
Whether training data using copyrighted data is allowed. Currently the Big AI are trying to say that it is fair use (see the NYT vs OpenAI case). This particular model do use permissive licenses that do not specifically forbid AI training, so it is clean regardless of whether it is fair use.
Whether the resulting model is a derivative work of the original works. In NYT vs OpenAI, OpenAI tries to claim that the resulting model is transformative and not a derivative work from the original works. My stance is I don't believe in that. In this particular model, they redistribute the entire training dataset ("The Common Pile") along with the copyright information so it should be suffice, as long as it is clear which exact version of the dataset was used.
Who owns the copyright to LLM output. Authors has been suing Big AI on this point that it may generate paywalled articles, snippets from copyrighted books, source code without following license, etc. In this particular model, the paper wrote that they don't know how to fix it today. As another comment mention, one solution in use today is to search for the output and block it from the end user if it matches other works. In my view, if you take a source code and rename the variables, shuffle blocks of code around, flip loop conditions that is still plagiarism but code searching might not be able to identify it. Kinda like making a pop art.
Thank you kindly for the breakdown! That makes sense, I like how you described big picture stuff folks are familiar with and then brought it back to their model.
Thank you kindly for the breakdown! That makes sense, I like how you described big picture stuff folks are familiar with and then brought it back to their model.
I think the point is that this LLM will still not be abiding by many of these licenses unless it is able to properly attribute information it summarizes or otherwise draws from a given text, given...
I think the point is that this LLM will still not be abiding by many of these licenses unless it is able to properly attribute information it summarizes or otherwise draws from a given text, given that many of these open licenses still require attribution.
I'm not knowledgeable enough about licensing to know exactly what's necessary to meet that requirement for these common licenses, but even if it is lacking in this respect, from my perspective it's still a huge step in the right direction.
A summary or paraphrase isn’t a copyright violation. You can’t copyright ideas, just the particular way they were expressed. It might be plagiarism if not credited, though.
A summary or paraphrase isn’t a copyright violation. You can’t copyright ideas, just the particular way they were expressed. It might be plagiarism if not credited, though.
I never claimed that a summary or paraphrase was a copyright violation, nor said anything about plagiarism. I was attempting to summarize someone else's point about abiding by the attribution...
I never claimed that a summary or paraphrase was a copyright violation, nor said anything about plagiarism. I was attempting to summarize someone else's point about abiding by the attribution requirements of a license. I was myself attempting to summarize/re-word something someone else said.
I'm not sure that it is strictly necessary to attribute the output of the model each time (I suspect attributing all the sources with relevant licenses on an about page somewhere would be sufficient to cover the terms of these licenses) but I don't know enough about these licenses to say that for sure, so I didn't want to make a strong claim on that front.
Replying to my own comment here. I've been thinking of a way to solve this problem, and an idea just come to mind. There's a mode in Aider called "Architect mode". In this mode two AI models pair...
Exemplary
Replying to my own comment here. I've been thinking of a way to solve this problem, and an idea just come to mind.
There's a mode in Aider called "Architect mode". In this mode two AI models pair programming with each other. The "architect" generate the solution, while the "editor" actually generate the code. They implemented this because o1 had excellent code reasoning ability, but poor code generation. In a way I think this is like clean room reverse engineering - you could use the tainted o3/o4 to generate description of the code you want (but never the actual code), then a permissive & instruction tuned model to generate the actual code. Then in the project's license you point to the model's training data source so that people can retrieve list of potential copyrighted code.
Update 1: I ran into several problems here.
First, Aider's prompts are very long, and o3-mini still generates some code. I tried hacking the config to add "only generate comments but never code", but it seems that it got lost in the elaborate prompts.
Second, I have a feeling that open models' 8k context limit will limit the usefulness in real world code (esp. Java) even if the instruction is very precise. For comparison, many commercial models have 128k limits.
Lastly, these models are not instruction tuned. i.e it's not a chat model, but only auto complete. There isn't enough public data to train that. From what I looked around the Starcoder2 ecosystem, there are 3 attempts:
The first way is Starchat by HuggingFace. They train it on a database of ChatGPT 3.5 conversations ("UberChat"). I don't think this is clean.
The second way is Starchat's own instruction tuned model. They use the model itself to generate code description of a sample code, then generate instructions. That becomes training data for the secondary training.
Lastly Octochat uses a database of filtered commit messages from permissive licensed repos, plus the Open Assistant database where volunteers role play both AI and users.
Right now Octochat still requires a very specific prompt, as it is still not a conversational model. We'll see if I could ask o3 to generate compatible prompts
Update 2: I did a PoC and I didn't know what to make of it. I used Jetbrains AI (cause they bundled the license with my purchase) to code an entire block of code, but I tell them to only write comments.
I get this
// 1. Launch another coroutine within the current scope whose purpose is to observe the
// `channel` declared above.
// 2. Inside that coroutine, iterate over the channel using `for (event in channel)` so the
// loop suspends until a new `ChatEvent` instance is received.
// 3. For every emitted `event`, apply a `when` expression to determine its concrete subtype.
// 4. When the subtype is `ChatEvent.ChatMessage`, extract `timestamp`, `author`, and `message`
// and compose a human-readable string, e.g. "`[HH:mm:ss] author: message`".
// 5. Print the constructed string via `println` so each chat message appears on the console
// as soon as it arrives.
// 6. Optionally add additional `when` branches for other `ChatEvent` subclasses—either
// printing them in a different format or simply ignoring them if not relevant.
I then feed it to Starcoder2 Instruct, who managed to complete the block very satisfyingly.
At this point I feel like that block is almost the human-readable version of the code's AST, and Starcoder is only there to convert AST back to code. I'm not sure of the legality of that code block now - if I wrote it I never seen other code like that, but since o3 wrote that description I can't 100% say it is not describing another code block(s) that it had seen.
In the previous iteration, I didn't tell o3 to emit identifiers which seems better, but since Starcoder only receive the current function it may choose the wrong methods to call.
Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.
Assuming that 8TB is the raw, actual text and not metadata/compression and picking some numbers arbitrarily... 8 TB / (8 B/word) / (500 word/minute) / (60 minute/hour) / (16 hour/day) / (365.22...
Assuming that 8TB is the raw, actual text and not metadata/compression and picking some numbers arbitrarily...
8 TB / (8 B/word) / (500 word/minute) / (60 minute/hour) / (16 hour/day) / (365.22 day / year) = 5,700 years to read all of that. Only 3,800 years if you skip sleeping!
I imagine some of those legal documents could get a bit dry but at least you can mix in those IRC chat logs to keep it exciting!
It would be neat if there was a drop box where people could just add more text to the data set, like a perpetual stew of language. (I can see the obvious issues with this though)
It would be neat if there was a drop box where people could just add more text to the data set, like a perpetual stew of language. (I can see the obvious issues with this though)
You don’t even have to train a new model for this. I’m fairly certain that 19th century British fiction is part of the data set that the large LLMs were trained on. There’s a technique called...
You don’t even have to train a new model for this. I’m fairly certain that 19th century British fiction is part of the data set that the large LLMs were trained on.
There’s a technique called LoRA(Low Rank Adaptation) where you can take the weights of an existing model and add an extra layer that “enhances” the weights associated with 19th century British literature. It’s how the Ghibli style image generation was created from a generalized image generation model without having to retrain a new model from scratch.
I don't think this database clears my threshold for a clean LLM.
The "openly licensed" section of the paper requires that the source's license must allow for use, retribution, modification, separation (the work is allowed to be used in smaller parts), compilation (The Pile is a compilation of openly licensed data) for any purpose without charges. Which they wrote that CC-BY, CC-BY-SA and the MIT license are part of it. However, those license requires that the the distribution of the work must retain the copyright & license statement, in which the paper wrote
So, if you use the model and it happen to generate MIT-licensed code, or a CC-BY content you're still required to check which correct license notice to add.
Do you mean that you don't think any LLM could be cleanly trained off this dataset, or that proper implementation is required to ensure the LLM remains clean beyond simply training off this dataset and that the paper authors have not created clean LLMs here?
I can imagine an implementation that compares the output against the training set and provides proper attribution in case of duplication, that seems clean to me at first thought, though not what these folks did.
There's a lot I don't know about licensing, I mean to further understand, not to antagonize your position.
I think the legal issues has 3 parts
Thank you kindly for the breakdown! That makes sense, I like how you described big picture stuff folks are familiar with and then brought it back to their model.
I think the point is that this LLM will still not be abiding by many of these licenses unless it is able to properly attribute information it summarizes or otherwise draws from a given text, given that many of these open licenses still require attribution.
I'm not knowledgeable enough about licensing to know exactly what's necessary to meet that requirement for these common licenses, but even if it is lacking in this respect, from my perspective it's still a huge step in the right direction.
A summary or paraphrase isn’t a copyright violation. You can’t copyright ideas, just the particular way they were expressed. It might be plagiarism if not credited, though.
I never claimed that a summary or paraphrase was a copyright violation, nor said anything about plagiarism. I was attempting to summarize someone else's point about abiding by the attribution requirements of a license. I was myself attempting to summarize/re-word something someone else said.
I'm not sure that it is strictly necessary to attribute the output of the model each time (I suspect attributing all the sources with relevant licenses on an about page somewhere would be sufficient to cover the terms of these licenses) but I don't know enough about these licenses to say that for sure, so I didn't want to make a strong claim on that front.
GitHub Copilot has a duplication detection filter that might help. I don't know how well it works, though.
Replying to my own comment here. I've been thinking of a way to solve this problem, and an idea just come to mind.
There's a mode in Aider called "Architect mode". In this mode two AI models pair programming with each other. The "architect" generate the solution, while the "editor" actually generate the code. They implemented this because o1 had excellent code reasoning ability, but poor code generation. In a way I think this is like clean room reverse engineering - you could use the tainted o3/o4 to generate description of the code you want (but never the actual code), then a permissive & instruction tuned model to generate the actual code. Then in the project's license you point to the model's training data source so that people can retrieve list of potential copyrighted code.
Update 1: I ran into several problems here.
First, Aider's prompts are very long, and o3-mini still generates some code. I tried hacking the config to add "only generate comments but never code", but it seems that it got lost in the elaborate prompts.
Second, I have a feeling that open models' 8k context limit will limit the usefulness in real world code (esp. Java) even if the instruction is very precise. For comparison, many commercial models have 128k limits.
Lastly, these models are not instruction tuned. i.e it's not a chat model, but only auto complete. There isn't enough public data to train that. From what I looked around the Starcoder2 ecosystem, there are 3 attempts:
The first way is Starchat by HuggingFace. They train it on a database of ChatGPT 3.5 conversations ("UberChat"). I don't think this is clean.
The second way is Starchat's own instruction tuned model. They use the model itself to generate code description of a sample code, then generate instructions. That becomes training data for the secondary training.
Lastly Octochat uses a database of filtered commit messages from permissive licensed repos, plus the Open Assistant database where volunteers role play both AI and users.
Right now Octochat still requires a very specific prompt, as it is still not a conversational model. We'll see if I could ask o3 to generate compatible prompts
Update 2: I did a PoC and I didn't know what to make of it. I used Jetbrains AI (cause they bundled the license with my purchase) to code an entire block of code, but I tell them to only write comments.
I get this
I then feed it to Starcoder2 Instruct, who managed to complete the block very satisfyingly.
At this point I feel like that block is almost the human-readable version of the code's AST, and Starcoder is only there to convert AST back to code. I'm not sure of the legality of that code block now - if I wrote it I never seen other code like that, but since o3 wrote that description I can't 100% say it is not describing another code block(s) that it had seen.
In the previous iteration, I didn't tell o3 to emit identifiers which seems better, but since Starcoder only receive the current function it may choose the wrong methods to call.
Here's the abstract:
It seems like a good start.
Assuming that 8TB is the raw, actual text and not metadata/compression and picking some numbers arbitrarily...
8 TB / (8 B/word) / (500 word/minute) / (60 minute/hour) / (16 hour/day) / (365.22 day / year) = 5,700 years to read all of that. Only 3,800 years if you skip sleeping!
I imagine some of those legal documents could get a bit dry but at least you can mix in those IRC chat logs to keep it exciting!
Perhaps an AI could read it and summarize it for us. :)
An Ouroboros summary if you will.
Now you really peaked my interest. What will be the one sentence that summarize 8TB of texts?
"Mostly harmless?"
(off topic but) the word is "piqued".
Thanks!
It would be neat if there was a drop box where people could just add more text to the data set, like a perpetual stew of language. (I can see the obvious issues with this though)
You don’t even have to train a new model for this. I’m fairly certain that 19th century British fiction is part of the data set that the large LLMs were trained on.
There’s a technique called LoRA(Low Rank Adaptation) where you can take the weights of an existing model and add an extra layer that “enhances” the weights associated with 19th century British literature. It’s how the Ghibli style image generation was created from a generalized image generation model without having to retrain a new model from scratch.