21
votes
Let's talk Local LLMs - So many questions
Hello there
(oh god, I am opening my first thread here - so exciting)
I'd love to ask the people here about local LLMs.
To be honest, I got interested in this topic, but am leaving reddit, where a sub r/locallama exists.
I don't want to interact with that site anymore, so I am taking this here.
My questions, to start us off:
- Models are available on huggingface (among other places), but where do I get the underlying software? I read "oogabooga" somewhere, but honestly, I am lost.
- If I only want to USE a local model, what are the requirements, and how do I judge if I can use something from the values of "4bit / 8 bit" and "30B, 7B"??
- If I get crazy and want to TRAIN a LorA ... what then?
- Good resources / wiki pages, tutorials, etc?
Perhaps it'd be good for you to state where you are right now, because you seem completely green. Nothing wrong with that, but if you know how to program, particularly python, that'd be useful.
For clarification: I've never heard "local LLM" as a technical term, so I assume you just want to run a LLM locally?
Last time I ran a not-so-large language model locally, it was as simple as following the model's guide on huggingface. install python. Pip install this or that. Then a python snippet that you could run and achieve working code.
I've found this: https://huggingface.co/declare-lab/flan-alpaca-gpt4-xl
Seems to be somewhat popular. Certainly enough to get your feet wet. I haven't tried it, but I'd think it's as simple as installing python, pip-installing "transformers" and running the code you see. Probably wanna use a venv if you know what that is, but if not, whatever. Running the code can be as simple as pasting the snippet in "Usage" into a file.py, and then running "python file.py" from the terminal. Huggingface tends to be very out-of-the-box like that. If you're not getting any results at all, as in, the process just takes forever, try a smaller model. Might be that the RAM needs exceed what your system can reasonably handle, or that it's trying to run it on the CPU.
BTW, NVidia or other GPU?
Edit: As for the other questions: the whole thing about bits and bytes is mostly about memory requirements. You can run the model on the CPU if need be. Will be slower, but still fine. Once the model size outscales your available RAM or GPU memory, you'll start to get into big trouble. I'm not completely confident in how to compute memory footprint, but as a ballpark figure, assume 4 Byte per parameter, and leave at least a bit of headroom. If the model explicitly states it uses 16bit floats, then 2 Byte per parameter. Parameter numbers are usually stated as 30M or 3B or so.
If you want to train one yourself, you better bring a lot of computational fire power. Like, preposterous amounts. Probably better to just rent it from google or amazon. But before that you want to get your feet wet in actually training them locally on a much smaller scale.
Hope that helps, let me know how it goes.
Oh... I am completely green using a LLM locally.
But.. reddit is now off limits for me, so thank you for helping.
I am not computer illiterate, some python is in my toolbox, and yes, I know what a venv is.
Thanks for the pointers, I will look at it.
NVidia
NVidia is good, as most of the tooling is based on CUDA. Means you can use your GPU, for example to train small models.
Any background in machine learning, or just a bit of programming?
See also my last post again, I added an edit.
Some background in machine learning that is - decades old.
So basically, no.
I linked oogabooga, as it gives a UI / less code oriented entry point.
Alright. I'm afraid I can't help you much with oogabooga. But it seems to have decent documentation, and a backend for transformers, which is the library used by the model I linked previously.
Also, to clarify my previous: Training is only really prohibitvely expensive if you insist on doing it from scratch. Fine-Tuning (e.g. lora-adaptation) a pre-trained model is reasonably doable on a good GPU. But you really need to make sure your entire model and at least a batch of training data fit on the GPU. You probably want to have a clearer goal in mind here than just "train it", unless you're doing it for the learning experience. What do you want it to learn? Where do you get the data? Hasn't someone already done that? Which pre-trained model do you use as a starting point?
No problem there, as this thread is not only for me.
I appreciate your help.
Yeah, I actually have no real goal for training in mind (yet), but I am curious - so yeah, learning experience.
Also - I did not want to train a LLM from scratch. I might be crazy, but not that crazy.
What is a "good" GPU in your mind?
I am running SD on an Nvidia GTX 1080ti, and with the current prices... that is my setup.
The very good news there is that you’ve got a respectable amount of VRAM. Memory optimisation is a key field in ML right now, but there are still limits: if your optimised model won’t fit in the GPU memory available the best case is paging bits of it in and out for what can easily be an order of magnitude slowdown, worst case is it just won’t run at all.
A newer card with 8GB VRAM might be say 40% faster (more CUDA cores, higher clock, etc.) if they’re both running a 6GB model, but I’d expect that to hit a brick wall when it comes to a 10GB model that the 1080ti could still potentially handle fine.
1080ti sounds decent. You won't be able to train everything you want on it, but it'll certainly do to get your feet wet.
If I were you, I'd explore the oogabooga docs for training. I found this: https://github.com/oobabooga/text-generation-webui/blob/main/docs/Training-LoRAs.md
My personal preference would be to just code the damn training thing myself, but I'm also "Not Invented Here Syndrome" personified. Actually that's not quite true. I'm perfectly willing to use what other people built, I just prefer fine-grained control and I like to understand what's going on under the hood, so I tend to from-scratch things sometimes just to learn.
I feel you, but ... yeah, thing is that I used to be a (mediocre) developer.
Been working as a project manager for over 15 years now, and my coding skills did not get an upgrade.
I've experimented with local LLMs a bit.
What worked well for me was to use koboldcpp, which bundles llama.cpp with a GUI you can use to interact with the model. It is mostly geared towards story-writing, but you can also use SillyTavern with it, if you prefer chat-based interaction.
As for getting actual LLM models for use with it, TheBloke has converted a ton of different models to the
ggml
format that llama.cpp uses. WizardLM and Guanaco, for example, are widely recognized as being very good; I like Guanaco a bit better than WizardLM.The big downside is that, with low VRAM, the models need to mostly run on the CPU, and hence the entire thing is slow.
The 65B Guanaco model is too slow for interactive use on my machine, even when loading several layers onto the GPU, but the smaller models are faster -- you'll have to experiment to find the sweet spot between responsiveness and quality.
There's also exllama for when you have enough VRAM to fit the entire model. My 8GB are enough to load quantized 7B parameter models, and generation speed is faster than I can read.
As a last note: if you're using the models for dialogue or as an assistant, you need to check the model cards for the correct prompt format, as different models need to be prompted in different ways for best results.
Thank you very much! I’ll take a look.
Last night I decided that I wanted to try PrivateGPT, and followed a youtube video (https://youtu.be/A3F5riM5BNE). That video tries to explain the process to install PrivateGPT from https://github.com/imartinez/privateGPT, and uses the Vicuna LLM (https://huggingface.co/vicuna/ggml-vicuna-13b-1.1/blob/main/ggml-vic13b-q5_1.bin).
It took a lot of futzing around with python and visual studio community to try and get it all to build (that process wasn't covered by the video). I finally got it built and running, injested the DnD 5e SRD, and finally asked my first test question - something like 'can you read this?'. It took 600 seconds to reply something along the lines of 'sure I can'. That's when I realized that my user experience was going to be less than optimal. Anyway, I learnt a huge amount by the sorta-trial and error, and I recommend that you do the same :)
Interesting, I will look at that.
I am not a programmer so have no experience with Python or any programming language and was able to follow this to get Wizard 30b up and running. It's from this reddit thread, full credit to the OP on Reddit /u/YearZero. Just pasting here since you're not using Reddit anymore
Incredibly simple guide to run language models locally on your PC, in 5 simple steps for non-techies.
TL;DR - follow steps 1 through 5. The rest is optional. Read the intro paragraph tho.
ChatGPT is a language model. You run it over the cloud. It is censored in many ways. These language models run on your computer, and your conversation with them is totally private. And it's free forever. And many of them are completely uncensored and will talk about anything, no matter how dirty or socially unacceptable, etc. The point is - this is your own personal private ChatGPT (not quite as smart) that will never refuse to discuss ANY topic, and is completely private and local on your machine. And yes it will write code for you too.
This guide is for Windows (but you can run them on Macs and Linux too).
Create a new folder on your computer.
Go here and download the latest koboldcpp.exe:
https://github.com/LostRuins/koboldcpp/releases
As of this writing, the latest version is 1.29
Stick that file into your new folder.
Leaderboard spreadsheet that I keep up to date with the latest models:
https://docs.google.com/spreadsheets/d/1NgHDxbVWJFolq8bLvLkuPWKC7i_R6I6W/edit?usp=sharing&ouid=102314596465921370523&rtpof=true&sd=true
Allow me to recommend a good starting model - a 7b parameter model that almost everyone will have the RAM to run:
guanaco-7B-GGML
Direct download link: https://huggingface.co/TheBloke/guanaco-7B-GGML/resolve/main/guanaco-7B.ggmlv3.q5_1.bin (needs 7GB ram to run on your computer)
Here's a great 13 billion parameter model if you have the RAM:
Nous-Hermes-13B-GGML
Direct download link: https://huggingface.co/TheBloke/Nous-Hermes-13B-GGML/resolve/main/nous-hermes-13b.ggmlv3.q5_1.bin (needs 12.26 GB of RAM to run on your computer)
Finally, the best (as of right now) 30 billion parameter model, if you have the RAM:
WizardLM-30B-GGML
Direct download link: https://huggingface.co/TheBloke/WizardLM-30B-GGML/resolve/main/wizardlm-30b.ggmlv3.q5_1.bin (needs 27 GB of RAM to run on your computer)
Put whichever .bin file you downloaded into the same folder as koboldcpp.exe
Technically that's it, just run koboldcpp.exe, and in the Threads put how many cores your CPU has. Check "Streaming Mode" and "Use SmartContext" and click Launch. Point to the model .bin file you downloaded, and voila.
Once it opens your new web browser tab (this is all local, it doesn't go to the internet), click on "Scenarios", select "New Instruct", and click Confirm.
You're DONE!
Now just talk to the model like ChatGPT and have fun with it. You have your very own large language model running on your computer, not using internet or some cloud service or anything else. It's yours forever, and it will do your bidding evil laugh. Try saying stuff that go against ChatGPT's "community guidelines" or whatever. Oh yeah - try other models! Explore!
Now, the rest is for those who'd like to explore a little more.
For example, if you have an NVIDIA or AMD video card, you can offload some of the model to that video card and it will potentially run MUCH FASTER!
Here's a very simple way to do it. When you launch koboldcpp.exe, click on "Use OpenBLAS" and choose "Use CLBlast GPU #1". Here it will ask you how many layers you want to offload to the GPU. Try putting 10 for starters and see what happens. If you can still talk to your model, try doing it again and raising the number. Eventually it will fail, and complain about not having enough VRAM (in the black command prompt window that opens up). Great, you've found your maximum layers for that model that your video card can handle, so bring the number down by 1 or 2 again so it doesn't run out of VRAM, and this is your max - for that model size.
This is very individual because it depends on the size of the model (7b, 13b, or 30b parameters) and how much VRAM your video card has. The more the better. If you have an RTX 4090 or RTX 3090 for example, you have 24 GB vram and you can offload the entire model fully to the video card and have it run incredibly fast.
The next part is for those who want to go a bit deeper still.
You can create a .bat file in the same folder for each model that you have. All those parameters that you pick when you ran koboldcpp.exe can be put into the .bat file so you don't have to pick them every time. Each model can have its own .bat file with all the parameters that you like for that model and work with your video card perfectly.
So you create a file, let's say something like "Kobold-wizardlm-30b.ggmlv3.q5_1.bat"
Here is what my file has inside:
title koboldcpp
:start
koboldcpp ^
--model wizardlm-30b.ggmlv3.q5_1.bin ^
--useclblast 0 0 ^
--gpulayers 14 ^
--threads 9 ^
--smartcontext ^
--usemirostat 2 0.1 0.1 ^
--stream ^
--launch
pause
goto start
Let me explain each line:
Oh by the way the ^ at the end of each line is just to allow multiple lines. All those lines are supposed to be one big line, but this allows you to split it into individual lines for readability. That's all that does.
"title" and "start" are not important lol
koboldcpp ^ - that's the .exe file you're launching.
--model wizardlm-30b.ggmlv3.q5_1.bin ^ - the name of the model file
--useclblast 0 0 ^ - enabling ClBlast mode. 0 0 points to your system and your video card. Occasionally it will be different for some people, like 1 0.
--gpulayers 14 ^ - how many layers you're offloading to the video card
--threads 9 ^ - how many CPU threads you're giving this model. A good rule of thumb is put how many physical cores your CPU has, but you can play around and see what works best.
--smartcontext ^ - an efficient/fast way to handle the context (the text you communicate to the model and its replies).
--usemirostat 2 0.1 0.1 ^ - don't ask, just put it in lol. It has to do with clever sampling of the tokens that the model chooses to respond to your inquiry. Each token is like a tiny piece of text, a bit less than a word, and the model chooses which token should go next like your iphone's text predictor. This is a clever algorithm to help it choose the good ones. Like I said, don't ask, just put it in! That's what she said.
--stream ^ - this is what allows the text your model responds with to start showing up as it is writing it, rather than waiting for its response to completely finish before it appears on your screen. This way it looks more like ChatGPT.
--launch - this makes the browser window/tab open automatically when you run the .bat file. Otherwise you'd have to open a tab in your browser yourself and type in "http://localhost:5001/?streaming=1#" as the destination yourself.
pause
goto start - don't worry about these, ask ChatGPT if you must, they're not important.
Ok now the next part is for those who want to go even deeper. You know you like it.
So when you go to one of the models, like here: https://huggingface.co/TheBloke/Nous-Hermes-13B-GGML/tree/main
You see a shitload of .bin files. How come there's so many? What are all those q4_0's and q5_1's, etc? Think of those as .jpg, while the original model is a .png. It's a lossy compression method for large language models - otherwise known as "quantization". It's a way to compress the model so it runs on less RAM or VRAM. It takes the weights and quantizes them, so each number which was originally FP16, is now a 4-bit or 5-bit or 6-bit. This makes the model slightly less accurate, but much smaller in size, so it can easily run on your local computer. Which one you pick isn't really vital, it has a bigger impact on your RAM usage and speed of inferencing (interacting with) the model than its accuracy.
A good rule of thumb is to pick q5_1 for any model's .bin file. When koboldcpp version 1.30 drops, you should pick q5_K_M. It's the new quantization method. This is bleeding edge and stuff is being updated/changed all the time, so if you try this guide in a month.. things might be different again. If you wanna know how the q_whatever compare, you can check the "Model Card" tab on huggingface, like here:
https://huggingface.co/TheBloke/Nous-Hermes-13B-GGML
TheBloke is a user who converts the most models into GGML and he always explains what's going on in his model cards because he's great. Buy him a coffee (also in the model card). He needs caffeine to do what he does for free for everybody. ALL DAY EVERY DAY.
Oh yeah - GGML is just a way to allow the models to run on your CPU (and partly on GPU, optionally). Otherwise they HAVE to run on GPU (video card) only. So the models initially come out for GPU, then someone like TheBloke creates a GGML repo on huggingface (the links with all the .bin files), and this allows koboldcpp to run them (this is a client that runs GGML/CPU versions of models). It allows anyone to run the models regardless of whether they have a good GPU or not. This is how I run them, and it allows you to run REALLY GOOD big models, all you need is enough RAM. RAM is cheap. Video cards like RTX 4090 are stupid expensive right now.
Ok this is the gist.
As always check out /r/LocalLLaMA/ for a dedicated community who is quite frankly obsessed with local models and they help each other figure all this out and find different ways to run them, etc. You can go much deeper than the depths we have already plumbed in this guide. There's more to learn, and basically it involves better understanding what these models are, how they work, how to run them using other methods (besides koboldcpp), what kind of bleeding edge progress is being made for local large language models that run on your machine, etc. There's tons of cool research and studies being done. We need more open source stuff like this to compete with OpenAI, Microsoft, etc. There's a whole community working on it for all our benefit.
I hope you find this helpful - it really is very easy, no code required, don't even have to install anything. But if you are comfortable with google colab, with pip installs, know your way around github, and other python-based stuff for example, well those options are there for you as well, and they open up other possibilities - like having the models interact with your local files, or create agents with the models so they all talk to each other with their own goals and personalities, etc.
Very nice! Thank you!
One resource I mentioned, but did not link:
https://github.com/oobabooga/text-generation-webui
That looks like an UI for other models, right? So basically, once you have an LLM up and running, you won't need to bother with editing the python code anymore and can interact with it from the browser. Useful, but I'd argue running the LLM first is what you'd want to focus on. This is just cherry-on-top kinda stuff.
Exactly. My guess (I am at work right now, so no installing here), is that this is for LLMs what Automatic1111 is for stable diffusion.
Which - at the moment - would be my sweet spot right now.
I am using GPT4all and is working well: https://gpt4all.io/index.html
It is CPU bound so relatively slow, but lets you choose your model.
I'll check that out as well.
I'll hijack this thread to ask my own questions:
Just googling a bit here,
These are interesting projects. Seems the Local LLM crowd is quite a bit more sophisticated than I first thought. Thanks for the insights :)
I'll add that GGML is also a format for models to run in llama.cpp. Quite a few models are available today. The format is in rapid flux though, since LLMs are still so experimental. Alternate quantization methods to GPTQ are still being tested (for example).
llama.cpp is also not strictly CPU-focused anymore. It received GPU support a couple weeks ago, and will likely receive enhanced CUDA support in the near-future.
GGML and llama.cpp are under constant development, largely as an experimental platform for the latest techniques in language models. It's pretty cool!
I’m not usually one for video tutorials (just give me a list of steps to follow please), but I was trying too hard to figure out how to run an LLM until I came across this channel. It’s pretty straightforward and aimed at people who aren’t very technical, but I’ve found the videos incredibly helpful.
I don’t think there’s anything wrong with using a web UI off the bat, and the Oobabooga project you mentioned works great. The video I linked literally walks you through every step in setting it up including downloading a model, so it’ll get you up and running in no time.
As far as requirements go, like another poster mentioned, VRAM is the biggest bottleneck. I’ve got an 8GB card and I’ve struggled to run any 13B models. I can get them loaded in, but they run out of memory as soon as a question is asked. I’ve had good experience with 7B models, though. You can always offload some layers to the CPU, but that will drastically increase the time it takes for responses to be generated.
I’d definitely recommend poking around that channel. I swing back to it every now and again to see if anything new and exciting has popped up, and there’s usually something fun to play with every time.
My card is an odd 13GB model, so I got that going for me.
I'm also not one for video tutorials, I'll still take a look or two.
Thank you!