16
votes
Looking to Llama. Help?
Hi folks
I'm progressing a project but I could use some insights.
I need to teach a LLM (preferably an open source and locally host-able) information about TV shows. I plan on using the show name, title, running time, episode quantity per series/season, and full transcript.
Where do I even start?
Pointers to sites to learn to do this would be much appreciated. If anyone can summarise how I need to prep the data then that would be a bonus too.
Bonus points for a Llama GUI that can be network hosted and allow different people to connect as individuals, a little like ChatGPT interface now.
Thank you in advance.
I did remember someone asking a similar question on HN and I did bookmark some things.
There are probably a few more systems like that out there. But I never really got around to try it out.
As a bit of a reality check, you phrased your question in such a way that it looks like to me that you are starting from absolute scratch and possibly with some overly optimistic expectations. Most of what I have seen out there so far is still fairly rough around the edges, requires a lot of reading and a TON of trial and error. Not to mention that a lot of the guides out there are written by people so deeply into things that they do fit "draw the rest of the fucking owl" meme well.
The simpler systems are... well simple. However if this is purely for personal use and you just want to see what is possible. If you do happen to have a modern Nvidia graphics card you could try out their demo tool they brought out a few weeks ago: https://www.nvidia.com/en-gb/ai-on-rtx/chat-with-rtx-generative-ai/
Seconding the RAG approach.
Compared to fine-tuning, you’re much better off system-prompting a base model of llama: “You will be asked about TV shows. Structure a query using this provided format to fetch yourself information about the listed series. Once you have that, answer the end user’s question.”
If you want actual reliable output, you’ll also want to use GBNF constraints to force the model output to conform to your RAG query structure.
Get ollama at: https://ollama.com/
And openwebui : https://github.com/open-webui/open-webui
Can you provide some more details about what you want to actually do? Generally speaking, the thing you want to look into is called fine tuning, and in the case of LLaMA you probably specifically want to look at LORA fine tuning because of the model size.
Here's a guide that is not too bad that will give you the general outline, but more specifics about your use case might lead to better answers.
https://deci.ai/blog/fine-tune-llama-2-with-lora-for-question-answering/
Here's the most common web accessible GUI interface
https://github.com/oobabooga/text-generation-webui
If you want a more fully featured interface, it's possible to use custom models in this system
https://open-assistant.io/
Fine-tuning a legitimately useful Llama model is prohibitively expensive for anyone without multiple 4090s hanging around.
The tutorial you linked was burning 12GB of GPU memory for fine-tuning a heavily-quantized 7B model, and OP doesn’t seem to have a suitable training set for LoRA in the first place.
It's quite possible to get useful results using the 13B model without significant hardware. A full 70B model can be fine tuned with dual 3090s, but whether that is even necessary depends on the use case.
I don’t think that’s accurate. Here’s a good reference:
https://github.com/taprosoft/llm_finetuning/blob/main/benchmark/README.md
They went OOM attempting to train a full-size 13B on an A100 (40 GB memory).
Mistral and maybe Vicuna are the only small
models I’ve gotten decent output from, and even then they’re quite spotty. I can’t imagine a heavily quantized 13B fine-tune working from a near one-shot corpus putting out anything besides hot garbage.
Any links to your 70b mention? If a strategy capable of that exists I’d love to know about it.
Well, I guess I disagree. I've gotten excellent results from quantized 13B models, which I have trained on a single 3090 with 24GB memory. It's even possible to fine tune the Llama1 30B model on a 3090.
It's important to have a well-defined use case and good training data, but there are a lot of tricks you can do to get that training data eg. how the Stanford Alpaca set came from GPT.
Let me break it down as this is a concept right now from an ask from the powers above.
Example: We have 5 TV shows, multiple episodes of each, sitting and waiting for resale.
A potential buyer may query if we have a show of X name, or a show that is 60 minutes in length, or one that talks about cars.
Using the media information outputted into json format (gives length, name of show, and lots of metadata) and the exported transcript of every single show, I want to feed this to AI to be able to then be queried as per the buyer perspective. It should then be able to answer the question.
I have all of the information, but it's either a json or VTT caption file. Formatting them up to give to a model to ingest and spit back info from shouldn't be too difficult.
That's really it. I'm doing this offline, I don't want to feed or pay for any third parties right now as it is all concept.
Fun fact: I know very little about AI models outside of Generative AI for images using Stable Diffusion.
Oh, that is why this topic smelled familiar. It is a continuation of the previous one, isn't it?
Why not include that as context? Including what your research has brought up until now. The way you phrased your topic now really makes it seem like you are starting out from scratch. But that clearly isn't the case :)
Including that sort of context helps people in helping you. As they have a much better frame of reference to answer your question.
It is.
So, my colleague has cracked using Tesseract to scan the videos and pull the clock cards. We have those. This bit is kind of niche, but if anyone wanted that code I'm sure he'd happily share.
MediaInfo grabs all of the data one could desire about the videos themselves. That's the json text file.
Whisper is doing the caption VTT export for us, so that's done.
Essentially, we now have data, we just don't know how to feed it to a LLM. YouTube has been a bit hit or miss, and the budget hasn't appeared to hire a pro/ML/AI data scientist/engineer. It's amazing the titles these people have when you go in that rabbit hole.
I'm going to be spinning up Oobabooga to play with, but that leaves the training or RAG dilemma and how to.
I would probably be inclined to use an LLM to summarize the transcripts and searching the summaries. I'm not very sold on the idea of using RAG -- I'd probably be using LLMs as part of a pipeline to ingest data into something like solr or elasticsearch, while possibly using an LLM to translate human language requests into formal search language.
The reason I'm not super sold on RAG is RAG is great when you have a problem where a small subset of supplemental data will help give you a good result, but less so where the desired result is a list of all responsive supplemental data. Anyhow, just my two cents.
Ollama is very nice for running inference quickly. There's also LMStudio which is free but not open sourced.
Have you had any success with Llama so far? Recently I've played around with it briefly (Llama v2 7B and 13B), but it was failing at a simple task: fixing the capitalization of articles to suit Tildes. I.e., lowercase most words instead of Capitalizing Every Word. Maybe I'd have to train it. Or get better at prompts.
And tried various prompts like
Basically it will either lowercase everything or nothing.
When I ask it to point out the proper nouns and abbreviations, it succeeds:
So it can do half of the task, but it doesn't seem to handle two steps.
Edit: Ha, I seem to have found a better prompt. Rubber ducking works! "Rewrite the title in sentence case"