Just got an Nvidia 4090 GPU, looking for local LLM + general generative AI software recommendations - ~comp

[16]

bioemerl

September 19, 2023

Link

With a 4090 you're in a weird spot. You've got two choices. A 13b model. Which you want depends on what you want to do. Like half of local models are used for role play and the "model of...

With a 4090 you're in a weird spot. You've got two choices.

A 13b model. Which you want depends on what you want to do. Like half of local models are used for role play and the "model of interest"there is something called mythomax. Use a program called oobabooga "text gen web ui" for that.

There is an option of a larger model, the 70b models. I'd recommend you use "airoborous" for general use, "Chronos" for creative use. However, to use these you'll need to run part of the model on your CPU because you don't have enough VRAM. A program called kobold cpp will do that.

And if you want to code - codellama34b is what you want.

10 votes

[14]
Jordan117
September 19, 2023
Link Parent
What makes you say that? I thought the 4090 was the most powerful consumer GPU by far. (Or are you comparing with the A100 et al?)

With a 4090 you're in a weird spot.

What makes you say that? I thought the 4090 was the most powerful consumer GPU by far. (Or are you comparing with the A100 et al?)

4 votes
1. [12]
  bioemerl
  September 20, 2023
  Link Parent
  24 gigs of VRAM is a weird amount that doesn't help for most models. Two 3090s is better
  
  24 gigs of VRAM is a weird amount that doesn't help for most models. Two 3090s is better
  
  4 votes
  1. [11]
    Jordan117
    September 20, 2023
    Link Parent
    Interesting -- I'm in the early stages of researching a new PC build for this stuff, and hadn't considered going with two (relatively) lower-powered cards. Thanks.
    
    Interesting -- I'm in the early stages of researching a new PC build for this stuff, and hadn't considered going with two (relatively) lower-powered cards. Thanks.
    
    1 vote
    
    [9]
    pbmonster
    September 20, 2023
    Link Parent
    It pretty much only makes sense if you want to work a lot with LLMs for text generation or something like stable diffusion for image generation. For those use cases, you benefit enormously from...
    
    It pretty much only makes sense if you want to work a lot with LLMs for text generation or something like stable diffusion for image generation.
    
    For those use cases, you benefit enormously from having more VRAM, and the actual speed of the GPU isn't even that important - because the amount of VRAM determines how large your model is, which is absolutely critical for quality. The speed of the GPU just gives you the results a bit faster, which often doesn't even matter that much.
    
    If you're primarily interested in performance gaming, get the best card you can afford. If you're only a little interested in generative AI, do your first experiments on rented cloud hardware. $10 will get you an A100 for an entire afternoon, enough time to properly play around with a large model.
    
    5 votes
    
    [7]
    Nazarie
    September 20, 2023
    Link Parent
    Where at?
    
    $10 will get you an A100 for an entire afternoon
    
    Where at?
    
    [5]
    Greg
    September 20, 2023
    Link Parent
    Sounds about right for CoreWeave or Runpod's on-demand pricing. The big three cloud providers tend to be a bit more expensive, but you can often get a few hundred dollars in signup credits and if...
    
    Sounds about right for CoreWeave or Runpod's on-demand pricing. The big three cloud providers tend to be a bit more expensive, but you can often get a few hundred dollars in signup credits and if you're lucky you might even find capacity on spot pricing.
    
    1 vote
    
    [4]
    Nazarie
    September 20, 2023
    Link Parent
    I tried to get an account with CoreWeave and they had to manually approve it. That required explaining to the sales team what I planned to do and it had to be large enough to make them gobs of...
    
    I tried to get an account with CoreWeave and they had to manually approve it. That required explaining to the sales team what I planned to do and it had to be large enough to make them gobs of money (reading between the lines). So, they can go pound sand for all i care, I'll never willingly use their services.
    
    [3]
    Greg
    September 21, 2023
    Link Parent
    Entirely your call whose services you use, of course, but I didn’t really take any offence at that requirement - admittedly I am a business user here, but when I signed up to test them out a few...
    
    Entirely your call whose services you use, of course, but I didn’t really take any offence at that requirement - admittedly I am a business user here, but when I signed up to test them out a few months back I pretty much put “don’t know, just want to see if you’re any good” in all the boxes and still got up and running a few hours later. They’ve got limited resources to allocate and they’re in extremely high demand, so the idea that they’re at least eyeballing the signups beforehand to make sure it’s done efficiently doesn’t strike me as unreasonable.
    
    1 vote
    
    [2]
    Nazarie
    September 21, 2023
    Link Parent
    It was much more than that. If it had been that easy I'd likely be using them. No, a sales person reached out and needed me to justify my usage levels and explain, in detail what we were building....
    
    It was much more than that. If it had been that easy I'd likely be using them. No, a sales person reached out and needed me to justify my usage levels and explain, in detail what we were building. Ironically, they wouldn't even get in a call to discuss it unless my answers made it worthwhile to even engage with me. That was more than enough to send me someplace else. Instead we have a committed use contract for 4 A100s through the end of the year with GCP and are trying to get 8+ H100s next year (assuming they finally let mere mortals access them).
    
    I get that availability is limited, but the whole experience left such a sour taste in my mouth that they will never get my business. Contrast that with Lambda Labs who were much more eglitarian and simply let use know that the minimum contract they could/would do was for 8xH100s for 1-3 years. We weren't quite ready to take on that cost. I'll gladly use Lambda Labs when we hit that level as their pricing on H100s seemed great and they didn't treat us like lesser people because we weren't asking for a whole rack.
    
    As you can tell, I have a strong dislike for CoreWeave at this point.
    
    1 vote
    
    Greg
    September 21, 2023
    Link Parent
    Oh that's fascinating - totally different to my experience, and yeah I'd be pissed in your position too in that case! I was assuming a roughly similar process, but it definitely sounds like they...
    
    Oh that's fascinating - totally different to my experience, and yeah I'd be pissed in your position too in that case! I was assuming a roughly similar process, but it definitely sounds like they changed things up somewhere along the line.
    
    pbmonster
    September 20, 2023
    Link Parent
    I've made good experience with runpod.io around the time after LLAMA-1 was released. No idea how booked their hardware is right now, availability has sometimes been a problem.
    
    I've made good experience with runpod.io around the time after LLAMA-1 was released. No idea how booked their hardware is right now, availability has sometimes been a problem.
    
    Greg
    September 20, 2023
    Link Parent
    I can see the potential for that with LLMs that just flat out don't fit otherwise, but I haven't found stable diffusion models to be especially taxing on VRAM (relatively speaking, at least! We're...
    
    I can see the potential for that with LLMs that just flat out don't fit otherwise, but I haven't found stable diffusion models to be especially taxing on VRAM (relatively speaking, at least! We're still talking single vs dual 24GB cards here) - larger batch sizes are useful, but that'll scale performance in linear steps with how many instances fit in VRAM rather than the binary yes/no of whether there's space for an individual model.
    
    tauon
    September 20, 2023
    Link Parent
    Even together with "relatively", the phrases 3090 and lower-powered feel like they shouldn't appear in the same sentence until at least like 2028… And I'm not even that old, just more used to...
    
    Even together with "relatively", the phrases 3090 and lower-powered feel like they shouldn't appear in the same sentence until at least like 2028… And I'm not even that old, just more used to laptops, lol.
    
    1 vote
2. Greg
  September 20, 2023
  Link Parent
  It’s kind of a middle ground between consumer and professional that’s not quite either one. Most things that are plausible for hobbyists will be aggressively quantised and optimised towards 12GB...
  
  It’s kind of a middle ground between consumer and professional that’s not quite either one.
  
  Most things that are plausible for hobbyists will be aggressively quantised and optimised towards 12GB because that’ll run well on cards that are more broadly accessible to home users, and most things that aren’t “mass market” will expect to be running on a 40GB card at bare minimum, but often across multiple linked 80GB cards.
  
  Not to say the 4090 is a bad shout for ML work, at all: the chip itself and the memory bandwidth are both excellent, and if you are running something that’d otherwise fit on a 3060 there’s space in the VRAM to crank up the batch size as well and add a significant multiplier to those advantages.
  
  If anything it has the potential to be too good. I wouldn’t be at all surprised if NVIDIA is keeping the VRAM “low” on the xx90 cards because they’d otherwise start cutting into the professional card sales: a 40GB 4090 would only cost ~$70 extra to make, and they’d have zero trouble selling them at a $500 markup to specialist users, but that’d risk cutting into the A6000’s market (which carries a $5,000 markup over the 4090) or even seeing HPC users hacking them together into clusters like you occasionally saw with PS3s back in the day, rather than buying A100s.
  
  3 votes
Handshape
September 19, 2023
Link Parent
Updoot for airoboros and derivatives, but adding that quantized versions exist for a lot of pre-baked LLMs, and that optimized runtimes and SIMD trickery can make it fit comfortably in as little...

Updoot for airoboros and derivatives, but adding that quantized versions exist for a lot of pre-baked LLMs, and that optimized runtimes and SIMD trickery can make it fit comfortably in as little as 4.5GB of VRAM... or even plain old RAM with a modern CPU.

Check out TheBloke on huggingface for pre-quantized models, and the llama.cpp inference runtime for speed. Set up cuBLAS on your machine, and then compile llama.cpp (or even easier, it's Python bindings) to bind to cuBLAS.

Once that's wired, airoboros goes brrrrt.

I have to imagine that a 4090 could run a quantized 30b model at speed.

2 votes

[2]

thefilmslayer

September 19, 2023

Link

Easy Diffusion can be run locally on your machine, and it's not too hard to figure out the basics. It allows for merging of models as well.

6 votes

Jasontherand
September 19, 2023
Link Parent
I also use easy diffusion. It's a great starting place and can do a lot.

I also use easy diffusion. It's a great starting place and can do a lot.

3 votes

Halfdan

September 19, 2023

Link

AUTOMATIC1111 is quite nice for images, and plenty easy to get into. There's tons of settings, but you can safely ignore those for now and just type a prompt of whatever you want to see. Oobabooga...

AUTOMATIC1111 is quite nice for images, and plenty easy to get into. There's tons of settings, but you can safely ignore those for now and just type a prompt of whatever you want to see.

Oobabooga is good for chatbots.

2 votes

hammurobbie

September 20, 2023

Link

For an easy way to get started, check out faraday.dev. If you're comfortable with git, oobaboogs is the way to go.

2 votes

[3]

manosinistra

September 20, 2023

Link

Sorry for the non-informative post, but some of the nomenclature used here makes me think it’s a gag thread. “Ogaboogaboinga”? It does help me to appreciate tho how out of the loop I am, and...

Sorry for the non-informative post, but some of the nomenclature used here makes me think it’s a gag thread. “Ogaboogaboinga”?

It does help me to appreciate tho how out of the loop I am, and conveys a sense of how nascent the whole field still is.

A few months ago, wrapping my head around what AUTOMATIC111 is led to a whole morning of rabbit holes.

1 vote

hammurobbie
September 20, 2023
Link Parent
Ha ha, yeah, they're the names of developers who maintain popular GitHub repositories, so they can be named pretty much anything.

Ha ha, yeah, they're the names of developers who maintain popular GitHub repositories, so they can be named pretty much anything.

3 votes
bioemerl
September 20, 2023
Link Parent
See the problem is the guy named his program Text Generation web UI Which is the most boring and dull name to ever exist so basically everyone calls the program by the developers username which is...

See the problem is the guy named his program

Text Generation web UI

Which is the most boring and dull name to ever exist so basically everyone calls the program by the developers username which is oobabooga

2 votes

flowerdance

September 20, 2023

Link

YouTube channel Jarods Journey is a good resource for AI as well. He's a really great guy whom I've been following for some time now.

1 vote