The Common Pile v0.1: An 8TB dataset of public domain and openly licensed text

[9]

whs

June 10

Link

I don't think this database clears my threshold for a clean LLM. The "openly licensed" section of the paper requires that the source's license must allow for use, retribution, modification,...

Exemplary

I don't think this database clears my threshold for a clean LLM.

The "openly licensed" section of the paper requires that the source's license must allow for use, retribution, modification, separation (the work is allowed to be used in smaller parts), compilation (The Pile is a compilation of openly licensed data) for any purpose without charges. Which they wrote that CC-BY, CC-BY-SA and the MIT license are part of it. However, those license requires that the the distribution of the work must retain the copyright & license statement, in which the paper wrote

Finally, we note that while it is relatively straightforward to obey
attribution requirements when redistributing data, attributing model predictions back to the training
data points that influenced them remains an active area of research

So, if you use the model and it happen to generate MIT-licensed code, or a CC-BY content you're still required to check which correct license notice to add.

16 votes

[6]
Carrow
June 10 (edited June 10)
Link Parent
Do you mean that you don't think any LLM could be cleanly trained off this dataset, or that proper implementation is required to ensure the LLM remains clean beyond simply training off this...

Do you mean that you don't think any LLM could be cleanly trained off this dataset, or that proper implementation is required to ensure the LLM remains clean beyond simply training off this dataset and that the paper authors have not created clean LLMs here?

I can imagine an implementation that compares the output against the training set and provides proper attribution in case of duplication, that seems clean to me at first thought, though not what these folks did.

There's a lot I don't know about licensing, I mean to further understand, not to antagonize your position.

5 votes
1. [2]
  whs
  June 11
  Link Parent
  I think the legal issues has 3 parts Whether training data using copyrighted data is allowed. Currently the Big AI are trying to say that it is fair use (see the NYT vs OpenAI case). This...
  
  I think the legal issues has 3 parts
  
  Whether training data using copyrighted data is allowed. Currently the Big AI are trying to say that it is fair use (see the NYT vs OpenAI case). This particular model do use permissive licenses that do not specifically forbid AI training, so it is clean regardless of whether it is fair use.
  
  Whether the resulting model is a derivative work of the original works. In NYT vs OpenAI, OpenAI tries to claim that the resulting model is transformative and not a derivative work from the original works. My stance is I don't believe in that. In this particular model, they redistribute the entire training dataset ("The Common Pile") along with the copyright information so it should be suffice, as long as it is clear which exact version of the dataset was used.
  
  Who owns the copyright to LLM output. Authors has been suing Big AI on this point that it may generate paywalled articles, snippets from copyrighted books, source code without following license, etc. In this particular model, the paper wrote that they don't know how to fix it today. As another comment mention, one solution in use today is to search for the output and block it from the end user if it matches other works. In my view, if you take a source code and rename the variables, shuffle blocks of code around, flip loop conditions that is still plagiarism but code searching might not be able to identify it. Kinda like making a pop art.
  
  8 votes
  1. Carrow
    June 11
    Link Parent
    Thank you kindly for the breakdown! That makes sense, I like how you described big picture stuff folks are familiar with and then brought it back to their model.
    
    Thank you kindly for the breakdown! That makes sense, I like how you described big picture stuff folks are familiar with and then brought it back to their model.
2. [3]
  sparksbet
  June 11
  Link Parent
  I think the point is that this LLM will still not be abiding by many of these licenses unless it is able to properly attribute information it summarizes or otherwise draws from a given text, given...
  
  I think the point is that this LLM will still not be abiding by many of these licenses unless it is able to properly attribute information it summarizes or otherwise draws from a given text, given that many of these open licenses still require attribution.
  
  I'm not knowledgeable enough about licensing to know exactly what's necessary to meet that requirement for these common licenses, but even if it is lacking in this respect, from my perspective it's still a huge step in the right direction.
  
  2 votes
  1. [2]
    skybrian (OP)
    June 11
    Link Parent
    A summary or paraphrase isn’t a copyright violation. You can’t copyright ideas, just the particular way they were expressed. It might be plagiarism if not credited, though.
    
    A summary or paraphrase isn’t a copyright violation. You can’t copyright ideas, just the particular way they were expressed. It might be plagiarism if not credited, though.
    
    4 votes
    
    sparksbet
    June 11
    Link Parent
    I never claimed that a summary or paraphrase was a copyright violation, nor said anything about plagiarism. I was attempting to summarize someone else's point about abiding by the attribution...
    
    I never claimed that a summary or paraphrase was a copyright violation, nor said anything about plagiarism. I was attempting to summarize someone else's point about abiding by the attribution requirements of a license. I was myself attempting to summarize/re-word something someone else said.
    
    I'm not sure that it is strictly necessary to attribute the output of the model each time (I suspect attributing all the sources with relevant licenses on an about page somewhere would be sufficient to cover the terms of these licenses) but I don't know enough about these licenses to say that for sure, so I didn't want to make a strong claim on that front.
    
    2 votes
skybrian (OP)
June 10
Link Parent
GitHub Copilot has a duplication detection filter that might help. I don't know how well it works, though.

GitHub Copilot has a duplication detection filter that might help. I don't know how well it works, though.

2 votes
whs
June 20 (edited June 24)
Link Parent
Replying to my own comment here. I've been thinking of a way to solve this problem, and an idea just come to mind. There's a mode in Aider called "Architect mode". In this mode two AI models pair...
- Exemplary
Replying to my own comment here. I've been thinking of a way to solve this problem, and an idea just come to mind.

There's a mode in Aider called "Architect mode". In this mode two AI models pair programming with each other. The "architect" generate the solution, while the "editor" actually generate the code. They implemented this because o1 had excellent code reasoning ability, but poor code generation. In a way I think this is like clean room reverse engineering - you could use the tainted o3/o4 to generate description of the code you want (but never the actual code), then a permissive & instruction tuned model to generate the actual code. Then in the project's license you point to the model's training data source so that people can retrieve list of potential copyrighted code.

Update 1: I ran into several problems here.

First, Aider's prompts are very long, and o3-mini still generates some code. I tried hacking the config to add "only generate comments but never code", but it seems that it got lost in the elaborate prompts.

Second, I have a feeling that open models' 8k context limit will limit the usefulness in real world code (esp. Java) even if the instruction is very precise. For comparison, many commercial models have 128k limits.

Lastly, these models are not instruction tuned. i.e it's not a chat model, but only auto complete. There isn't enough public data to train that. From what I looked around the Starcoder2 ecosystem, there are 3 attempts:

The first way is Starchat by HuggingFace. They train it on a database of ChatGPT 3.5 conversations ("UberChat"). I don't think this is clean.

The second way is Starchat's own instruction tuned model. They use the model itself to generate code description of a sample code, then generate instructions. That becomes training data for the secondary training.

Lastly Octochat uses a database of filtered commit messages from permissive licensed repos, plus the Open Assistant database where volunteers role play both AI and users.

Right now Octochat still requires a very specific prompt, as it is still not a conversational model. We'll see if I could ask o3 to generate compatible prompts

Update 2: I did a PoC and I didn't know what to make of it. I used Jetbrains AI (cause they bundled the license with my purchase) to code an entire block of code, but I tell them to only write comments.

I get this
```
// 1. Launch another coroutine within the current scope whose purpose is to observe the
//    `channel` declared above.
// 2. Inside that coroutine, iterate over the channel using `for (event in channel)` so the
//    loop suspends until a new `ChatEvent` instance is received.
// 3. For every emitted `event`, apply a `when` expression to determine its concrete subtype.
// 4. When the subtype is `ChatEvent.ChatMessage`, extract `timestamp`, `author`, and `message`
//    and compose a human-readable string, e.g. "`[HH:mm:ss] author: message`".
// 5. Print the constructed string via `println` so each chat message appears on the console
//    as soon as it arrives.
// 6. Optionally add additional `when` branches for other `ChatEvent` subclasses—either
//    printing them in a different format or simply ignoring them if not relevant.
```
I then feed it to Starcoder2 Instruct, who managed to complete the block very satisfyingly.

At this point I feel like that block is almost the human-readable version of the code's AST, and Starcoder is only there to convert AST back to code. I'm not sure of the legality of that code block now - if I wrote it I never seen other code like that, but since o3 wrote that description I can't 100% say it is not describing another code block(s) that it had seen.

In the previous iteration, I didn't tell o3 to emit identifiers which seems better, but since Starcoder only receive the current function it may choose the wrong methods to call.

Update 3 (probably final)

I don't know what hit me, but I'm writing a paper on this. It'd be my first paper.

I did some experiment and I was impressed when I asked GPT 4o mini & Starcoder2 to generate Fast Inverse Square Root and I got a version that use union instead of pointer magic. I even got one version that use a better constant. Unfortunately, Starcoder2 is trained on The Stack v2, which contains unlicensed code (as in, the license for the code cannot be detected).

I tried to get Comma to run, but even on my RTX 5090 with 32GB VRAM the 7B model managed to used all that memory. I then try to use bnb-my-repo to quantize it down which it does load, but the output is all gibberish. The model here is a base model - it is like an autocomplete model, designed to be trained upon. One of the most common way to train it is "instruction tuning" where it learns to follow instructions, which is another research area on how to do this cleanly (one common way to do that is to train on collected ChatGPT logs, since those are not copyrighted - but is that the right way to go?).

I wrote these in more details in my paper, but I suppose if I can't get Comma to run properly I might have to use Starcoder2.
1 vote

skybrian (OP)

June 10

Link

Here's the abstract: It seems like a good start.

Here's the abstract:

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.

It seems like a good start.

20 votes

[7]

1338

June 10

Link

Assuming that 8TB is the raw, actual text and not metadata/compression and picking some numbers arbitrarily... 8 TB / (8 B/word) / (500 word/minute) / (60 minute/hour) / (16 hour/day) / (365.22...

Assuming that 8TB is the raw, actual text and not metadata/compression and picking some numbers arbitrarily...

8 TB / (8 B/word) / (500 word/minute) / (60 minute/hour) / (16 hour/day) / (365.22 day / year) = 5,700 years to read all of that. Only 3,800 years if you skip sleeping!

I imagine some of those legal documents could get a bit dry but at least you can mix in those IRC chat logs to keep it exciting!

15 votes

[6]
skybrian (OP)
June 10
Link Parent
Perhaps an AI could read it and summarize it for us. :)

Perhaps an AI could read it and summarize it for us. :)

12 votes
1. Englerdy
  June 10
  Link Parent
  An Ouroboros summary if you will.
  
  An Ouroboros summary if you will.
  
  4 votes
2. [4]
  Deely
  June 10
  Link Parent
  Now you really peaked my interest. What will be the one sentence that summarize 8TB of texts?
  
  Now you really peaked my interest. What will be the one sentence that summarize 8TB of texts?
  
  2 votes
  1. skybrian (OP)
    June 10
    Link Parent
    "Mostly harmless?"
    
    "Mostly harmless?"
    
    22 votes
  2. [2]
    fidwell
    June 11
    Link Parent
    (off topic but) the word is "piqued".
    
    (off topic but) the word is "piqued".
    
    6 votes
    
    Deely
    June 11
    Link Parent
    Thanks!
    
    Thanks!
    
    1 vote

chili-man

June 10

Link

It would be neat if there was a drop box where people could just add more text to the data set, like a perpetual stew of language. (I can see the obvious issues with this though)

8 votes

[2]

Comment deleted by author

Link

ackables
June 11
Link Parent
You don’t even have to train a new model for this. I’m fairly certain that 19th century British fiction is part of the data set that the large LLMs were trained on. There’s a technique called...

You don’t even have to train a new model for this. I’m fairly certain that 19th century British fiction is part of the data set that the large LLMs were trained on.

There’s a technique called LoRA(Low Rank Adaptation) where you can take the weights of an existing model and add an extra layer that “enhances” the weights associated with 19th century British literature. It’s how the Ghibli style image generation was created from a generalized image generation model without having to retrain a new model from scratch.

5 votes

Link information

19 comments