20 votes

Just got an Nvidia 4090 GPU, looking for local LLM + general generative AI software recommendations

I was fortunate enough to grab a discounted 4090 while on my travels and just got everything installed. Already having a lot of fun pumping all my games to max settings, but I'm also interested in running generative AI stuff locally to really take advantage of all that VRAM.

Do you have any newbie-friendly Windows 11 software to recommend for getting started? Thanks!

24 comments

  1. [16]
    bioemerl
    Link
    With a 4090 you're in a weird spot. You've got two choices. A 13b model. Which you want depends on what you want to do. Like half of local models are used for role play and the "model of...

    With a 4090 you're in a weird spot. You've got two choices.

    A 13b model. Which you want depends on what you want to do. Like half of local models are used for role play and the "model of interest"there is something called mythomax. Use a program called oobabooga "text gen web ui" for that.

    There is an option of a larger model, the 70b models. I'd recommend you use "airoborous" for general use, "Chronos" for creative use. However, to use these you'll need to run part of the model on your CPU because you don't have enough VRAM. A program called kobold cpp will do that.

    And if you want to code - codellama34b is what you want.

    10 votes
    1. [14]
      Jordan117
      Link Parent
      What makes you say that? I thought the 4090 was the most powerful consumer GPU by far. (Or are you comparing with the A100 et al?)

      With a 4090 you're in a weird spot.

      What makes you say that? I thought the 4090 was the most powerful consumer GPU by far. (Or are you comparing with the A100 et al?)

      4 votes
      1. [12]
        bioemerl
        Link Parent
        24 gigs of VRAM is a weird amount that doesn't help for most models. Two 3090s is better

        24 gigs of VRAM is a weird amount that doesn't help for most models. Two 3090s is better

        4 votes
        1. [11]
          Jordan117
          Link Parent
          Interesting -- I'm in the early stages of researching a new PC build for this stuff, and hadn't considered going with two (relatively) lower-powered cards. Thanks.

          Interesting -- I'm in the early stages of researching a new PC build for this stuff, and hadn't considered going with two (relatively) lower-powered cards. Thanks.

          1 vote
          1. [9]
            pbmonster
            Link Parent
            It pretty much only makes sense if you want to work a lot with LLMs for text generation or something like stable diffusion for image generation. For those use cases, you benefit enormously from...

            It pretty much only makes sense if you want to work a lot with LLMs for text generation or something like stable diffusion for image generation.

            For those use cases, you benefit enormously from having more VRAM, and the actual speed of the GPU isn't even that important - because the amount of VRAM determines how large your model is, which is absolutely critical for quality. The speed of the GPU just gives you the results a bit faster, which often doesn't even matter that much.

            If you're primarily interested in performance gaming, get the best card you can afford. If you're only a little interested in generative AI, do your first experiments on rented cloud hardware. $10 will get you an A100 for an entire afternoon, enough time to properly play around with a large model.

            5 votes
            1. [7]
              Nazarie
              Link Parent
              Where at?

              $10 will get you an A100 for an entire afternoon

              Where at?

              1. [5]
                Greg
                Link Parent
                Sounds about right for CoreWeave or Runpod's on-demand pricing. The big three cloud providers tend to be a bit more expensive, but you can often get a few hundred dollars in signup credits and if...

                Sounds about right for CoreWeave or Runpod's on-demand pricing. The big three cloud providers tend to be a bit more expensive, but you can often get a few hundred dollars in signup credits and if you're lucky you might even find capacity on spot pricing.

                1 vote
                1. [4]
                  Nazarie
                  Link Parent
                  I tried to get an account with CoreWeave and they had to manually approve it. That required explaining to the sales team what I planned to do and it had to be large enough to make them gobs of...

                  I tried to get an account with CoreWeave and they had to manually approve it. That required explaining to the sales team what I planned to do and it had to be large enough to make them gobs of money (reading between the lines). So, they can go pound sand for all i care, I'll never willingly use their services.

                  1. [3]
                    Greg
                    Link Parent
                    Entirely your call whose services you use, of course, but I didn’t really take any offence at that requirement - admittedly I am a business user here, but when I signed up to test them out a few...

                    Entirely your call whose services you use, of course, but I didn’t really take any offence at that requirement - admittedly I am a business user here, but when I signed up to test them out a few months back I pretty much put “don’t know, just want to see if you’re any good” in all the boxes and still got up and running a few hours later. They’ve got limited resources to allocate and they’re in extremely high demand, so the idea that they’re at least eyeballing the signups beforehand to make sure it’s done efficiently doesn’t strike me as unreasonable.

                    1 vote
                    1. [2]
                      Nazarie
                      Link Parent
                      It was much more than that. If it had been that easy I'd likely be using them. No, a sales person reached out and needed me to justify my usage levels and explain, in detail what we were building....

                      It was much more than that. If it had been that easy I'd likely be using them. No, a sales person reached out and needed me to justify my usage levels and explain, in detail what we were building. Ironically, they wouldn't even get in a call to discuss it unless my answers made it worthwhile to even engage with me. That was more than enough to send me someplace else. Instead we have a committed use contract for 4 A100s through the end of the year with GCP and are trying to get 8+ H100s next year (assuming they finally let mere mortals access them).

                      I get that availability is limited, but the whole experience left such a sour taste in my mouth that they will never get my business. Contrast that with Lambda Labs who were much more eglitarian and simply let use know that the minimum contract they could/would do was for 8xH100s for 1-3 years. We weren't quite ready to take on that cost. I'll gladly use Lambda Labs when we hit that level as their pricing on H100s seemed great and they didn't treat us like lesser people because we weren't asking for a whole rack.

                      As you can tell, I have a strong dislike for CoreWeave at this point.

                      1 vote
                      1. Greg
                        Link Parent
                        Oh that's fascinating - totally different to my experience, and yeah I'd be pissed in your position too in that case! I was assuming a roughly similar process, but it definitely sounds like they...

                        Oh that's fascinating - totally different to my experience, and yeah I'd be pissed in your position too in that case! I was assuming a roughly similar process, but it definitely sounds like they changed things up somewhere along the line.

              2. pbmonster
                Link Parent
                I've made good experience with runpod.io around the time after LLAMA-1 was released. No idea how booked their hardware is right now, availability has sometimes been a problem.

                I've made good experience with runpod.io around the time after LLAMA-1 was released. No idea how booked their hardware is right now, availability has sometimes been a problem.

            2. Greg
              Link Parent
              I can see the potential for that with LLMs that just flat out don't fit otherwise, but I haven't found stable diffusion models to be especially taxing on VRAM (relatively speaking, at least! We're...

              I can see the potential for that with LLMs that just flat out don't fit otherwise, but I haven't found stable diffusion models to be especially taxing on VRAM (relatively speaking, at least! We're still talking single vs dual 24GB cards here) - larger batch sizes are useful, but that'll scale performance in linear steps with how many instances fit in VRAM rather than the binary yes/no of whether there's space for an individual model.

          2. tauon
            Link Parent
            Even together with "relatively", the phrases 3090 and lower-powered feel like they shouldn't appear in the same sentence until at least like 2028… And I'm not even that old, just more used to...

            Even together with "relatively", the phrases 3090 and lower-powered feel like they shouldn't appear in the same sentence until at least like 2028… And I'm not even that old, just more used to laptops, lol.

            1 vote
      2. Greg
        Link Parent
        It’s kind of a middle ground between consumer and professional that’s not quite either one. Most things that are plausible for hobbyists will be aggressively quantised and optimised towards 12GB...

        It’s kind of a middle ground between consumer and professional that’s not quite either one.

        Most things that are plausible for hobbyists will be aggressively quantised and optimised towards 12GB because that’ll run well on cards that are more broadly accessible to home users, and most things that aren’t “mass market” will expect to be running on a 40GB card at bare minimum, but often across multiple linked 80GB cards.

        Not to say the 4090 is a bad shout for ML work, at all: the chip itself and the memory bandwidth are both excellent, and if you are running something that’d otherwise fit on a 3060 there’s space in the VRAM to crank up the batch size as well and add a significant multiplier to those advantages.

        If anything it has the potential to be too good. I wouldn’t be at all surprised if NVIDIA is keeping the VRAM “low” on the xx90 cards because they’d otherwise start cutting into the professional card sales: a 40GB 4090 would only cost ~$70 extra to make, and they’d have zero trouble selling them at a $500 markup to specialist users, but that’d risk cutting into the A6000’s market (which carries a $5,000 markup over the 4090) or even seeing HPC users hacking them together into clusters like you occasionally saw with PS3s back in the day, rather than buying A100s.

        3 votes
    2. Handshape
      Link Parent
      Updoot for airoboros and derivatives, but adding that quantized versions exist for a lot of pre-baked LLMs, and that optimized runtimes and SIMD trickery can make it fit comfortably in as little...

      Updoot for airoboros and derivatives, but adding that quantized versions exist for a lot of pre-baked LLMs, and that optimized runtimes and SIMD trickery can make it fit comfortably in as little as 4.5GB of VRAM... or even plain old RAM with a modern CPU.

      Check out TheBloke on huggingface for pre-quantized models, and the llama.cpp inference runtime for speed. Set up cuBLAS on your machine, and then compile llama.cpp (or even easier, it's Python bindings) to bind to cuBLAS.

      Once that's wired, airoboros goes brrrrt.

      I have to imagine that a 4090 could run a quantized 30b model at speed.

      2 votes
  2. [2]
    thefilmslayer
    Link
    Easy Diffusion can be run locally on your machine, and it's not too hard to figure out the basics. It allows for merging of models as well.

    Easy Diffusion can be run locally on your machine, and it's not too hard to figure out the basics. It allows for merging of models as well.

    6 votes
    1. Jasontherand
      Link Parent
      I also use easy diffusion. It's a great starting place and can do a lot.

      I also use easy diffusion. It's a great starting place and can do a lot.

      3 votes
  3. Halfdan
    Link
    AUTOMATIC1111 is quite nice for images, and plenty easy to get into. There's tons of settings, but you can safely ignore those for now and just type a prompt of whatever you want to see. Oobabooga...

    AUTOMATIC1111 is quite nice for images, and plenty easy to get into. There's tons of settings, but you can safely ignore those for now and just type a prompt of whatever you want to see.

    Oobabooga is good for chatbots.

    2 votes
  4. hammurobbie
    Link
    For an easy way to get started, check out faraday.dev. If you're comfortable with git, oobaboogs is the way to go.

    For an easy way to get started, check out faraday.dev. If you're comfortable with git, oobaboogs is the way to go.

    2 votes
  5. [3]
    manosinistra
    Link
    Sorry for the non-informative post, but some of the nomenclature used here makes me think it’s a gag thread. “Ogaboogaboinga”? It does help me to appreciate tho how out of the loop I am, and...

    Sorry for the non-informative post, but some of the nomenclature used here makes me think it’s a gag thread. “Ogaboogaboinga”?

    It does help me to appreciate tho how out of the loop I am, and conveys a sense of how nascent the whole field still is.

    A few months ago, wrapping my head around what AUTOMATIC111 is led to a whole morning of rabbit holes.

    1 vote
    1. hammurobbie
      Link Parent
      Ha ha, yeah, they're the names of developers who maintain popular GitHub repositories, so they can be named pretty much anything.

      Ha ha, yeah, they're the names of developers who maintain popular GitHub repositories, so they can be named pretty much anything.

      3 votes
    2. bioemerl
      Link Parent
      See the problem is the guy named his program Text Generation web UI Which is the most boring and dull name to ever exist so basically everyone calls the program by the developers username which is...

      See the problem is the guy named his program

      Text Generation web UI

      Which is the most boring and dull name to ever exist so basically everyone calls the program by the developers username which is oobabooga

      2 votes
  6. flowerdance
    Link
    YouTube channel Jarods Journey is a good resource for AI as well. He's a really great guy whom I've been following for some time now.

    YouTube channel Jarods Journey is a good resource for AI as well. He's a really great guy whom I've been following for some time now.

    1 vote