I haven't gone super deep into this project specifically, and it looks like it's pretty early days in some ways, but the "crowdsource a meaningful, open source training dataset" that they're doing...
Exemplary
I haven't gone super deep into this project specifically, and it looks like it's pretty early days in some ways, but the "crowdsource a meaningful, open source training dataset" that they're doing right now seems valuable regardless of what happens next. Even if they as an organisation don't manage to hit the end goal, the data's going to be valuable to the field as a whole.
More general thoughts:
They're right that the InstructGPT paper lays out a feasible (and proven) structure for how all this can work. ChatGPT isn't quite identical to that, but OpenAI themselves describe it as a "sibling model" - it's reasonable to think the open source community can match and exceed it if they can get the resources to do so.
Running large language models is expensive, training them even more so. Creating a model on the scale of GPT-3 is a $3-5m investment purely in compute time, even if all of the R&D, training data, code, etc. is donated for free by the community.
Luckily, there's at least one publicly funded model already available at that scale, along with a fine tuned version designed for instructions rather than completions.
Runtime cost is still currently in the $10k/month range, which puts some pretty significant limits on how this can be used despite the model being freely available.
Open Assistant are targeting something that'll run in 24GB VRAM (i.e. a single consumer GPU, albeit one at the top end of the market). Doing that while maintaining quality of results is IMO by far the hardest thing on their roadmap, and is the one thing I see that is so far totally unproven.
There seems to be ongoing discussion about training their own smaller model from scratch vs modifying and fine tuning existing large models. I haven't dug down too much there, but the conversations are happening and both options are being entertained, which seems sensible.
Externally, there are also projects like Petals designed for running large models in a P2P distributed way. I personally think this is a more feasible option for opening this tech up to home users, and it's also been discussed on the project issues, but again it's early days.
This is a cool idea. Anyone with machine learning expertise care to comment on whether it's likely to work?
I haven't gone super deep into this project specifically, and it looks like it's pretty early days in some ways, but the "crowdsource a meaningful, open source training dataset" that they're doing right now seems valuable regardless of what happens next. Even if they as an organisation don't manage to hit the end goal, the data's going to be valuable to the field as a whole.
More general thoughts:
Link to the video introducing the project.