8 votes

Applying Chinese Wall Reverse Engineering to LLM Code Editing

Posted July 22 by whs

Tags: artificial intelligence, language models.large, data.training, copyright, pdf, reverse engineering, comma, gemini, original content, author.manatsawin hanmongkolchai, source.arxiv, long read

https://arxiv.org/abs/2507.15599

Link information

This data is scraped automatically and may be incorrect.

Published: Jul 22 2025

1 comment

whs (OP)
July 22
Link
Hi Tildes! Last month skybrain posted The Common Pile - a LLM model that are trained from openly licensed data. I posted that it might not pass my threshold, as it may still output copyright code...
Hi Tildes!

Last month skybrain posted The Common Pile - a LLM model that are trained from openly licensed data.

I posted that it might not pass my threshold, as it may still output copyright code without proper attribution, but at least it might be possible to attribute to the entire training dataset. However, as the training set is small you can't expect much quality from the model - it's a PoC not something you would actually use. I then realize that you could use another LLM to coach the weaker model.

So, instead of writing a comment on that thread, or a blog post, I set on figuring out how to write my first paper.

The short summary is in the benchmark, I managed to improve Comma by ~20% by asking Gemini 2.5 Pro to write a detailed task description.

Would I use this? I think there are two problems left to solve
1. I need a better model than Comma. Right now it can't even get the Fast Inverse Square Root algorithm right. Maybe how Starcoder2:Instruct was made could be applied to make Comma:Instruct?
2. Which tool to implement this? It is some work to implement a new mode in the AI coding tools, so I need a tool that works well with self-hosted models and open source. Maybe Aider/Roo Code? I tried Claude Code last night and it didn't write a single line of code with Gemma3:27b with 72k context window.
3 votes