Google’s TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x

[6]

Greg

March 28 (edited March 28)

Link

Between this from Google (it’s a very significant optimisation to KV caching, for anyone following along), and DeepSeek’s “Engram” conditional static memory lookups, it’s looking like there’s...

Between this from Google (it’s a very significant optimisation to KV caching, for anyone following along), and DeepSeek’s “Engram” conditional static memory lookups, it’s looking like there’s another decent step up in model performance coming. I’m actually a little surprised to be saying that, I was starting to think we were plateauing a bit and potentially resorting to slightly more brute force approaches to making improvements, so it’ll be interesting to see how much more steam there is in this broad class of LLM architectures!

[Edit] DeepMind’s recent paper about doing actual, publishable mathematical research with an LLM is very significant too IMO. This isn’t some beam search brute force approach to constructing a technically novel but ultimately unreadable thousand page proof, it seems to be a research team genuinely creating new knowledge with the aid of an LLM, which strikes me as an important tipping point in capability.

23 votes

Wes
March 28
Link Parent
While I only follow along as a layman, I've seen some interesting discussion in the mathematics community about LLMs helping to find lemmas. Donald Knuth also recently wrote about Claude helping...

While I only follow along as a layman, I've seen some interesting discussion in the mathematics community about LLMs helping to find lemmas. Donald Knuth also recently wrote about Claude helping him solve a problem.

It does seem like there is some genuine value in frontier mathematics.

8 votes
post_below
March 28
Link Parent
Anthropic is claiming a step change in an upcoming, larger than Opus, model release. But they'll need both memory and inference optimizations, they're already pushing the limits of their available...

Anthropic is claiming a step change in an upcoming, larger than Opus, model release. But they'll need both memory and inference optimizations, they're already pushing the limits of their available compute due to skyrocketing demand. Maybe they've already integrated some version of similar optimizations. Google's is the latest headline but a lot of groups have been working on it.

5 votes
[3]
skybrian
March 28 (edited March 28)
Link Parent
Perhaps Google deployed TurboQuant already? They were pretty early with supporting long-context conversations. The Engram paper is pretty interesting too.

Perhaps Google deployed TurboQuant already? They were pretty early with supporting long-context conversations.

The Engram paper is pretty interesting too.

3 votes
1. [2]
  Greg
  March 28
  Link Parent
  Yeah, that absolutely seems plausible - will be interesting to see how (and if) models from the big players change as this newly published stuff cross-pollinates between them and mixes with...
  
  Yeah, that absolutely seems plausible - will be interesting to see how (and if) models from the big players change as this newly published stuff cross-pollinates between them and mixes with whatever tricks and techniques they each haven’t published!
  
  3 votes
  1. tauon
    March 28 (edited 6 days, 1 hour ago)
    Link Parent
    Between this *Google research now and the recent tech of fitting like 400B parameter models on an iPhone, it’s looking to be quite an interesting time for running local models! (I’m not finding it...
    
    Between this *Google research now and the recent tech of fitting like 400B parameter models on an iPhone, it’s looking to be quite an interesting time for running local models!
    
    (I’m not finding it right now, but I saw someone somewhere improving upon this repo, which was about running big models on laptop hardware, and that was already impressive enough in itself if you ask me: https://github.com/danveloper/flash-moe)
    
    Edit – found it! Saw it in this tw*tter post that was shared somewhere: https://xcancel.com/anemll/status/2035901335984611412?s=20. They’ve also forked the above repo at the same name as their twitter handle, and with tweaks for a fork of a fork at Anemll/Flash-iOS.)
    
    4 votes

hungariantoast (OP)

March 27

Link

Blog post from Google: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/ And the paper: https://arxiv.org/abs/2504.19874

Blog post from Google:

https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/

And the paper:

https://arxiv.org/abs/2504.19874

10 votes

[12]

Narry

March 27

Link

RAM prices might be going down? Maybe? I hope? Narry needs a new Linux gaming box...

10 votes

[4]
slade
March 28
Link Parent
Or maybe we just move 6 times faster towards the bubble pop.

Or maybe we just move 6 times faster towards the bubble pop.

17 votes
1. [3]
  Greg
  March 28
  Link Parent
  In which case used server farm RAM prices really go down, it’s a win-win! Win-win not guaranteed, management takes no responsibility for job losses, riots, rising fascism, economic collapse,...
  
  In which case used server farm RAM prices really go down, it’s a win-win!
  
  Win-win not guaranteed, management takes no responsibility for job losses, riots, rising fascism, economic collapse, plagues of locusts, or indigestion.
  
  21 votes
  1. JCPhoenix
    March 28
    Link Parent
    Well at least the sticks of RAM will keep us warm and fed.
    
    Well at least the sticks of RAM will keep us warm and fed.
    
    6 votes
  2. creesch
    March 28
    Link Parent
    I am slightly optimistic for a long term recovery once the surviving companies with actual products realize they actually need proper developers, testers, etc to untangle the LLM induced mess....
    
    for job losses, riots, rising fascism, economic collapse, plagues of locusts, or indigestion.
    
    I am slightly optimistic for a long term recovery once the surviving companies with actual products realize they actually need proper developers, testers, etc to untangle the LLM induced mess. </wishful-thinking>
    
    4 votes
[7]
CptBluebear
March 28
Link Parent
Unlikely, if you can reduce RAM usage you can now cram more model into the same RAM. It's basically induced demand. More bandwidth available is more bandwidth used.

Unlikely, if you can reduce RAM usage you can now cram more model into the same RAM. It's basically induced demand. More bandwidth available is more bandwidth used.

12 votes
1. [2]
  tauon
  March 28
  Link Parent
  It’s just like the, ahem, beloved “only one more lane, bro” known from automotive highway planning!
  
  It’s just like the, ahem, beloved “only one more lane, bro” known from automotive highway planning!
  
  10 votes
  1. CptBluebear
    March 28
    Link Parent
    Yes, exactly that. There's a proposal in my municipality to force trucks onto the bus lane instead of the road which should, according to them, alleviate pressure on traffic. At the same time...
    
    Yes, exactly that.
    
    There's a proposal in my municipality to force trucks onto the bus lane instead of the road which should, according to them, alleviate pressure on traffic.
    At the same time they're ditching the idea to add another bus route to a nearby city.
    
    I don't have to tell you what will likely happen if this plan goes through.
    
    6 votes
2. [3]
  Narry
  March 28
  Link Parent
  A man can dream… a man can dream.
  
  A man can dream… a man can dream.
  
  5 votes
  1. [2]
    CptBluebear
    March 28
    Link Parent
    I commiserate friend. The only feasible upgrade route for my own PC is to get an AM5 processor and therefore DDR5 RAM. It's prohibitively expensive to the point I'm not sure when it will be...
    
    I commiserate friend. The only feasible upgrade route for my own PC is to get an AM5 processor and therefore DDR5 RAM. It's prohibitively expensive to the point I'm not sure when it will be possible.
    
    Here's to hoping none of my current parts break.
    
    8 votes
    
    Narry
    6 days, 20 hours ago
    Link Parent
    I’ve taken to leaving my ancient rig turned off until I really have the urge to play something, then update everything for an hour, forget what I wanted to play and shut down the computer again...
    
    I’ve taken to leaving my ancient rig turned off until I really have the urge to play something, then update everything for an hour, forget what I wanted to play and shut down the computer again for several weeks or months.
    
    3 votes
3. skybrian
  March 28
  Link Parent
  The AI labs do provide cheaper models, so this depends on customer behavior. Are they going to keep switching to the best model available or will they decide at some point to save money?...
  
  The AI labs do provide cheaper models, so this depends on customer behavior. Are they going to keep switching to the best model available or will they decide at some point to save money?
  
  Anecdotally, I use Sonnet rather than Opus for writing code most of the time to cut costs, because it seems good enough.
  
  2 votes

[9]

Nihilego

March 28

Link

I skimmed through this quickly but I assume this algorithm would be restricted to just Google or corporations? Can I benefit from this if I want to homelab? I’ve got a 1060 and while running...

I skimmed through this quickly but I assume this algorithm would be restricted to just Google or corporations?

Can I benefit from this if I want to homelab?

I’ve got a 1060 and while running ollama on it is kind of impressive the first few times, it is a 6GB VRAM card after all.

2 votes

[6]

hungariantoast (OP)

March 28 (edited March 28)

Link Parent

Anyone can implement TurboQuant. There's already work being done to add it to llama.cpp (what Ollama uses under the hood) and other inference software. It's also possible to use TurboQuant on...

Anyone can implement TurboQuant. There's already work being done to add it to llama.cpp (what Ollama uses under the hood) and other inference software. It's also possible to use TurboQuant on existing models.

Also, Ollama is kind of bad :( . I'd recommend just using llama.cpp directly because it gives you more control over how a model is ran, and you will get better performance.

For reference, here is the command I use to run a GGUF formatted Qwen3.5-4B model on my system:

llama-server --model ~/.local/models/hugging_face/unsloth/Qwen3.5-4B-GGUF/Qwen3.5-4B-UD-Q4_K_XL.gguf \
--ctx-size 262144 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 \
--reasoning on --repeat-penalty 1.1

--ctx-size 262144: Sets the size of the "context window" (how many tokens the model can "remember" at once). I set it to 262,144, the maximum size supported by Qwen3.5 (without shenanigans). However, I am also using a GPU with 12GB VRAM, so you should probably drop this number down to about 128,000 (at least).

--temp 0.6: Sets the "termperature" of the model's output to 0.6 on a scale of 0 to 1. This controls how deterministic or random a model's responses will be. A lower temperature is more deterministic and "focused", while a higher temperature is more random and "creative". What type of work you want the model to do will determine the temperature you want to run it at. Keep in mind that even if you set the temperature to 1, a model's output is never truly, completely deterministic.

--top-p 0.95: Sets the "nucleus sampling probability" to 0.95. I don't understand this flag very well, but I think the way it works is that the model forms a list of tokens for the next prediction (such as the next word in a sentence). These tokens are sorted with their individual probabilities, in descending order (t1=0.5, t2=0.2, t3=0.1, t4=0.09, etc). The model then collects the first X number of tokens until the cumulative probability of all collected tokens is 0.95.

--top-k 20: Does kind of the same thing as top-p, but limits the amount of tokens for the next prediction to 20.

--mini-p 0.00: Filters out tokens whose relative probability to the highest probability token is below the threshold. A value of 0.00 disables the feature and allows for broader sampling for the next prediction.

--reasoning on: Enables the model's "reasoning" mode where it "thinks" about the response it will give before actually responding. This is disabled by-default for the 9B and smaller Qwen3.5 models. Whatever model you are using might not even support reasoning, and whether you want to enable it or not will depend on the model and the work it will do.

--repeat-penalty 1.1: Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully disc^C

14 votes

Omnicrola
March 28
Link Parent
Well done xD

Applies a slight penalty to repeated tokens to hopefully discourage idiot doom spirals. Applies a slight penalty to repeated tokens to hopefully disc^C

Well done xD

7 votes

[4]

chili-man

March 28

Link Parent

What's your experience with the performance diff on Ollama vs. llama.cpp? I like that it's easy to use and was assuming it didn't have much penalty, but if I'm leaving a lot on the table I'd be...

What's your experience with the performance diff on Ollama vs. llama.cpp? I like that it's easy to use and was assuming it didn't have much penalty, but if I'm leaving a lot on the table I'd be curious to know.

2 votes

[3]

hungariantoast (OP)

March 28

Link Parent

I did a quick test with Qwen3.5-4B. Specifically, I used the UD-Q4_K_XL GGUF from Unsloth for llama.cpp, and the default Q4_K_M quant selected by Ollama. Here is the command for llama.cpp:...

I did a quick test with Qwen3.5-4B. Specifically, I used the UD-Q4_K_XL GGUF from Unsloth for llama.cpp, and the default Q4_K_M quant selected by Ollama.

Here is the command for llama.cpp:

llama-server --model ~/.local/models/hugging_face/unsloth/Qwen3.5-4B-UD-Q4_K_XL.gguf \
--ctx-size 262144 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 \
--repeat-penalty 1.1 --reasoning on --verbosity 3

The command for Ollama:

ollama --verbose run qwen3.5:4b

For both Ollama and llama.cpp I used Vulkan for GPU acceleration.

The prompt was: What does the word "erudite" mean?

The results were:

	Ollama	llama.cpp
Prompt size	21 tokens	21 tokens
Prompt processing duration	89.01ms	83.57ms
Prompt processing speed	235.92 tokens/s	251.29 tokens/s
Generation size	794 tokens	742 tokens
Generation time	13.80s	9.59s
Generation speed	57.53 tokens/s	77.44 tokens/s

There was a noticeable bump in generation speeds for llama.cpp. This was not a perfectly fair comparison though. I did not use the exact same model file for each test, but that is because Ollama hashes the model files it downloads and I don't know how to run them in llama.cpp. I also don't know where Ollam downloads its model files from, and I am too lazy to find out.

However, llama.cpp should actually be at a disadvantage, because I used the maximum context window size and a much larger model file (5.6 GiB) for its test, than I did with Ollama (3.2 GiB).

Of course, Ollama uses (a fork of) llama.cpp under the hood, so I'm sure there are ways to tweak it to perform similarly. If you're going to do that though, you might as well use llama.cpp directly. It has a web UI, a routing mode, automatic model offloading, and recently added support for MCP servers. I'm not sure if Ollama offers anything llama.cpp does not.

Recently I have spent more time than I care to admit experimenting with llama.cpp, OpenCode, and generally just finding ways to make local models useful. If you have any more questions, feel free to ask.

4 votes

[2]
chili-man
March 28
Link Parent
Wow, thanks for testing it! That seems significant enough that I'll look into setting it up instead. I picked up ollama 2 years ago when it was pretty new, and was the easiest way (especially for...

Wow, thanks for testing it! That seems significant enough that I'll look into setting it up instead. I picked up ollama 2 years ago when it was pretty new, and was the easiest way (especially for AMD cards, at least that I found) and never really considered anything else.

2 votes
1. hungariantoast (OP)
  March 28
  Link Parent
  No problem! Like I said, I've spent way too much time lately messing with this stuff. I'm happy to have an excuse to write about it One quick note about AMD though: I generally get much faster...
  
  No problem! Like I said, I've spent way too much time lately messing with this stuff. I'm happy to have an excuse to write about it
  
  One quick note about AMD though: I generally get much faster processing and generation speeds when I run llama.cpp with ROCm than I do with Vulkan.
  
  However, getting ROCm installed can be a huge pain, and whether it supports your card (and what capabilities it supports on your card) is difficult to figure out.
  
  On top of that, when I run llama.cpp with ROCm and have a model loaded (not even doing anything, just loaded into VRAM), my computer becomes almost unusable. I can't even switch focus to another window without gnarly stuttering. ROCm seems much more aggressive with how it allocates VRAM. I think if I were running llama.cpp/ROCm "headless" on this computer, and doing all my other work on another device, it would work great, but I haven't got around to trying that yet.
  
  3 votes

Greg
March 28
Link Parent
It’s a published paper, that means others can and will reimplement it - and likely pretty quickly, given the speed of OSS development in the field. You might be interested in the links that @tauon...

It’s a published paper, that means others can and will reimplement it - and likely pretty quickly, given the speed of OSS development in the field. You might be interested in the links that @tauon mentioned further up, too? Those are available here and now, and the results look seriously impressive!

1 vote
tauon
March 28
Link Parent
I think a good proxy for whether it’s restricted to their own offerings or not is public knowledge: If they wanted Gemini to be the only model/model family that has the tech available, we wouldn’t...

I think a good proxy for whether it’s restricted to their own offerings or not is public knowledge: If they wanted Gemini to be the only model/model family that has the tech available, we wouldn’t even know about this research (or it’d just get published very silently without an accompanying research.google blog post).

1 vote

Link information

28 comments