I did remember someone asking a similar question on HN and I did bookmark some things. You likely don't need to train the model. What you need is a RAG system “Retrieval-Augmented Generation”...
I did remember someone asking a similar question on HN and I did bookmark some things.
You likely don't need to train the model. What you need is a RAG system “Retrieval-Augmented Generation” which basically feeds your information to the LLM in a way it can easily crawl through it and use it without being specifically trained on it.
There are probably a few more systems like that out there. But I never really got around to try it out.
As a bit of a reality check, you phrased your question in such a way that it looks like to me that you are starting from absolute scratch and possibly with some overly optimistic expectations. Most of what I have seen out there so far is still fairly rough around the edges, requires a lot of reading and a TON of trial and error. Not to mention that a lot of the guides out there are written by people so deeply into things that they do fit "draw the rest of the fucking owl" meme well.
The simpler systems are... well simple. However if this is purely for personal use and you just want to see what is possible. If you do happen to have a modern Nvidia graphics card you could try out their demo tool they brought out a few weeks ago: https://www.nvidia.com/en-gb/ai-on-rtx/chat-with-rtx-generative-ai/
Seconding the RAG approach. Compared to fine-tuning, you’re much better off system-prompting a base model of llama: “You will be asked about TV shows. Structure a query using this provided format...
Seconding the RAG approach.
Compared to fine-tuning, you’re much better off system-prompting a base model of llama: “You will be asked about TV shows. Structure a query using this provided format to fetch yourself information about the listed series. Once you have that, answer the end user’s question.”
If you want actual reliable output, you’ll also want to use GBNF constraints to force the model output to conform to your RAG query structure.
Can you provide some more details about what you want to actually do? Generally speaking, the thing you want to look into is called fine tuning, and in the case of LLaMA you probably specifically...
Can you provide some more details about what you want to actually do? Generally speaking, the thing you want to look into is called fine tuning, and in the case of LLaMA you probably specifically want to look at LORA fine tuning because of the model size.
Here's a guide that is not too bad that will give you the general outline, but more specifics about your use case might lead to better answers.
Fine-tuning a legitimately useful Llama model is prohibitively expensive for anyone without multiple 4090s hanging around. The tutorial you linked was burning 12GB of GPU memory for fine-tuning a...
Fine-tuning a legitimately useful Llama model is prohibitively expensive for anyone without multiple 4090s hanging around.
The tutorial you linked was burning 12GB of GPU memory for fine-tuning a heavily-quantized 7B model, and OP doesn’t seem to have a suitable training set for LoRA in the first place.
It's quite possible to get useful results using the 13B model without significant hardware. A full 70B model can be fine tuned with dual 3090s, but whether that is even necessary depends on the...
It's quite possible to get useful results using the 13B model without significant hardware. A full 70B model can be fine tuned with dual 3090s, but whether that is even necessary depends on the use case.
I don’t think that’s accurate. Here’s a good reference: https://github.com/taprosoft/llm_finetuning/blob/main/benchmark/README.md They went OOM attempting to train a full-size 13B on an A100 (40...
I don’t think that’s accurate. Here’s a good reference:
They went OOM attempting to train a full-size 13B on an A100 (40 GB memory).
Mistral and maybe Vicuna are the only small
models I’ve gotten decent output from, and even then they’re quite spotty. I can’t imagine a heavily quantized 13B fine-tune working from a near one-shot corpus putting out anything besides hot garbage.
Any links to your 70b mention? If a strategy capable of that exists I’d love to know about it.
Well, I guess I disagree. I've gotten excellent results from quantized 13B models, which I have trained on a single 3090 with 24GB memory. It's even possible to fine tune the Llama1 30B model on a...
Well, I guess I disagree. I've gotten excellent results from quantized 13B models, which I have trained on a single 3090 with 24GB memory. It's even possible to fine tune the Llama1 30B model on a 3090.
It's important to have a well-defined use case and good training data, but there are a lot of tricks you can do to get that training data eg. how the Stanford Alpaca set came from GPT.
Oh, that is why this topic smelled familiar. It is a continuation of the previous one, isn't it? Why not include that as context? Including what your research has brought up until now. The way you...
Oh, that is why this topic smelled familiar. It is a continuation of the previous one, isn't it?
Why not include that as context? Including what your research has brought up until now. The way you phrased your topic now really makes it seem like you are starting out from scratch. But that clearly isn't the case :)
Including that sort of context helps people in helping you. As they have a much better frame of reference to answer your question.
I would probably be inclined to use an LLM to summarize the transcripts and searching the summaries. I'm not very sold on the idea of using RAG -- I'd probably be using LLMs as part of a pipeline...
I would probably be inclined to use an LLM to summarize the transcripts and searching the summaries. I'm not very sold on the idea of using RAG -- I'd probably be using LLMs as part of a pipeline to ingest data into something like solr or elasticsearch, while possibly using an LLM to translate human language requests into formal search language.
The reason I'm not super sold on RAG is RAG is great when you have a problem where a small subset of supplemental data will help give you a good result, but less so where the desired result is a list of all responsive supplemental data. Anyhow, just my two cents.
Have you had any success with Llama so far? Recently I've played around with it briefly (Llama v2 7B and 13B), but it was failing at a simple task: fixing the capitalization of articles to suit...
Have you had any success with Llama so far? Recently I've played around with it briefly (Llama v2 7B and 13B), but it was failing at a simple task: fixing the capitalization of articles to suit Tildes. I.e., lowercase most words instead of Capitalizing Every Word. Maybe I'd have to train it. Or get better at prompts.
Yesterday I gave Llama this title: "Apple Releases macOS Sonoma 14.4.1 With Fix for USB Hub Bug"
And tried various prompts like
>>> treat Apple as a proper noun and rewrite the title again, with common nouns and non-abbreviations in lowercase, and capitalizing proper
... nouns and abbreviations
Certainly! Here is the rewritten title, treating "Apple" as a proper noun and using lowercase for common nouns and non-abbreviations,
and capitalizing proper nouns and abbreviations:
Apple Releases macOS Sonoma 14.4.1 with Fix for USB Hub Bug
Basically it will either lowercase everything or nothing.
When I ask it to point out the proper nouns and abbreviations, it succeeds:
Capitalized Proper Nouns and Abbreviations:
* Apple (proper noun)
* MacOS (proper noun)
* Sonoma (proper noun)
* USB (abbreviation)
So it can do half of the task, but it doesn't seem to handle two steps.
Edit: Ha, I seem to have found a better prompt. Rubber ducking works! "Rewrite the title in sentence case"
I did remember someone asking a similar question on HN and I did bookmark some things.
There are probably a few more systems like that out there. But I never really got around to try it out.
As a bit of a reality check, you phrased your question in such a way that it looks like to me that you are starting from absolute scratch and possibly with some overly optimistic expectations. Most of what I have seen out there so far is still fairly rough around the edges, requires a lot of reading and a TON of trial and error. Not to mention that a lot of the guides out there are written by people so deeply into things that they do fit "draw the rest of the fucking owl" meme well.
The simpler systems are... well simple. However if this is purely for personal use and you just want to see what is possible. If you do happen to have a modern Nvidia graphics card you could try out their demo tool they brought out a few weeks ago: https://www.nvidia.com/en-gb/ai-on-rtx/chat-with-rtx-generative-ai/
Seconding the RAG approach.
Compared to fine-tuning, you’re much better off system-prompting a base model of llama: “You will be asked about TV shows. Structure a query using this provided format to fetch yourself information about the listed series. Once you have that, answer the end user’s question.”
If you want actual reliable output, you’ll also want to use GBNF constraints to force the model output to conform to your RAG query structure.
Get ollama at: https://ollama.com/
And openwebui : https://github.com/open-webui/open-webui
Can you provide some more details about what you want to actually do? Generally speaking, the thing you want to look into is called fine tuning, and in the case of LLaMA you probably specifically want to look at LORA fine tuning because of the model size.
Here's a guide that is not too bad that will give you the general outline, but more specifics about your use case might lead to better answers.
https://deci.ai/blog/fine-tune-llama-2-with-lora-for-question-answering/
Here's the most common web accessible GUI interface
https://github.com/oobabooga/text-generation-webui
If you want a more fully featured interface, it's possible to use custom models in this system
https://open-assistant.io/
Fine-tuning a legitimately useful Llama model is prohibitively expensive for anyone without multiple 4090s hanging around.
The tutorial you linked was burning 12GB of GPU memory for fine-tuning a heavily-quantized 7B model, and OP doesn’t seem to have a suitable training set for LoRA in the first place.
It's quite possible to get useful results using the 13B model without significant hardware. A full 70B model can be fine tuned with dual 3090s, but whether that is even necessary depends on the use case.
I don’t think that’s accurate. Here’s a good reference:
https://github.com/taprosoft/llm_finetuning/blob/main/benchmark/README.md
They went OOM attempting to train a full-size 13B on an A100 (40 GB memory).
Mistral and maybe Vicuna are the only small
models I’ve gotten decent output from, and even then they’re quite spotty. I can’t imagine a heavily quantized 13B fine-tune working from a near one-shot corpus putting out anything besides hot garbage.
Any links to your 70b mention? If a strategy capable of that exists I’d love to know about it.
Well, I guess I disagree. I've gotten excellent results from quantized 13B models, which I have trained on a single 3090 with 24GB memory. It's even possible to fine tune the Llama1 30B model on a 3090.
It's important to have a well-defined use case and good training data, but there are a lot of tricks you can do to get that training data eg. how the Stanford Alpaca set came from GPT.
Oh, that is why this topic smelled familiar. It is a continuation of the previous one, isn't it?
Why not include that as context? Including what your research has brought up until now. The way you phrased your topic now really makes it seem like you are starting out from scratch. But that clearly isn't the case :)
Including that sort of context helps people in helping you. As they have a much better frame of reference to answer your question.
I would probably be inclined to use an LLM to summarize the transcripts and searching the summaries. I'm not very sold on the idea of using RAG -- I'd probably be using LLMs as part of a pipeline to ingest data into something like solr or elasticsearch, while possibly using an LLM to translate human language requests into formal search language.
The reason I'm not super sold on RAG is RAG is great when you have a problem where a small subset of supplemental data will help give you a good result, but less so where the desired result is a list of all responsive supplemental data. Anyhow, just my two cents.
Ollama is very nice for running inference quickly. There's also LMStudio which is free but not open sourced.
Have you had any success with Llama so far? Recently I've played around with it briefly (Llama v2 7B and 13B), but it was failing at a simple task: fixing the capitalization of articles to suit Tildes. I.e., lowercase most words instead of Capitalizing Every Word. Maybe I'd have to train it. Or get better at prompts.
And tried various prompts like
Basically it will either lowercase everything or nothing.
When I ask it to point out the proper nouns and abbreviations, it succeeds:
So it can do half of the task, but it doesn't seem to handle two steps.
Edit: Ha, I seem to have found a better prompt. Rubber ducking works! "Rewrite the title in sentence case"