I thought this was an excellent breakdown of the new MTP feature in local LLMs. This write-up focuses on Gemma 4, and was written by a member of Google DeepMind. MTP is used in speculative...
I thought this was an excellent breakdown of the new MTP feature in local LLMs. This write-up focuses on Gemma 4, and was written by a member of Google DeepMind.
MTP is used in speculative decoding. It couples a larger dense model with a smaller "draft" model. In the case of Gemma 4, the draft model only has 76M parameters.
Typically with LLMs, calculating each token carries the same cost, even if the remaining tokens seem obvious or trivial. MTP works by letting the draft model take a crack at generating a handful of tokens, then passing those back to the larger model to verify. Because they share a lot of their cache data, this process is fast and even the misses have minimal cost.
This approach trades a little vram (to load the draft model) for a substantial speedup of token generation. No accuracy is lost, as the larger model will take over in the case of any disagreement. From looking at some of the early tests, I'm seeing numbers like 1.85x speedup for an 11% vram increase.
Support for MTP is coming soon to llama.cpp, and should then work its way out to derivatives like LM Studio. It will also support other MTP-enabled models like Qwen 3.6.
Hey @hungariantoast, just to let you know, I was intentionally linking to the MTP section of this article. There's a lot of buzz about it right now and this was the best explanation I could find....
Hey @hungariantoast, just to let you know, I was intentionally linking to the MTP section of this article. There's a lot of buzz about it right now and this was the best explanation I could find.
The rest of the article is good, too, and probably worth reading, but I thought that was the most topical bit. If changing the URL though, then the title should probably also be updated to reflect that.
Thanks. Sorry it took me a few days to notice. I'll add that some other resources have since come out on the topic. Google now has a blog post on MTP with a less-technical breakdown, which links...
Thanks. Sorry it took me a few days to notice.
I'll add that some other resources have since come out on the topic. Google now has a blog post on MTP with a less-technical breakdown, which links to a research article (Nov 2022) with a more technical breakdown.
My bad. I either didn't realize the link to that specific section was intentional or, because of the formatting, didn't realize it was a link to a specific section at all, and I've slept since...
My bad. I either didn't realize the link to that specific section was intentional or, because of the formatting, didn't realize it was a link to a specific section at all, and I've slept since then, so I don't remember which it was ;)
Thanks for letting me know, and thanks @mycketforvirrad for fixing it
I thought this was an excellent breakdown of the new MTP feature in local LLMs. This write-up focuses on Gemma 4, and was written by a member of Google DeepMind.
MTP is used in speculative decoding. It couples a larger dense model with a smaller "draft" model. In the case of Gemma 4, the draft model only has 76M parameters.
Typically with LLMs, calculating each token carries the same cost, even if the remaining tokens seem obvious or trivial. MTP works by letting the draft model take a crack at generating a handful of tokens, then passing those back to the larger model to verify. Because they share a lot of their cache data, this process is fast and even the misses have minimal cost.
This approach trades a little vram (to load the draft model) for a substantial speedup of token generation. No accuracy is lost, as the larger model will take over in the case of any disagreement. From looking at some of the early tests, I'm seeing numbers like 1.85x speedup for an 11% vram increase.
Support for MTP is coming soon to llama.cpp, and should then work its way out to derivatives like LM Studio. It will also support other MTP-enabled models like Qwen 3.6.
Hey @hungariantoast, just to let you know, I was intentionally linking to the MTP section of this article. There's a lot of buzz about it right now and this was the best explanation I could find.
The rest of the article is good, too, and probably worth reading, but I thought that was the most topical bit. If changing the URL though, then the title should probably also be updated to reflect that.
Reverted it back to your link.
Thanks. Sorry it took me a few days to notice.
I'll add that some other resources have since come out on the topic. Google now has a blog post on MTP with a less-technical breakdown, which links to a research article (Nov 2022) with a more technical breakdown.
My bad. I either didn't realize the link to that specific section was intentional or, because of the formatting, didn't realize it was a link to a specific section at all, and I've slept since then, so I don't remember which it was ;)
Thanks for letting me know, and thanks @mycketforvirrad for fixing it
No worries. The URL formatting is a little confusing because it encodes the section symbol (§) into this mess (
%C2%A7).