7 votes

Multi-Token Prediction (MTP) with Gemma 4

1 comment

  1. Wes
    Link
    I thought this was an excellent breakdown of the new MTP feature in local LLMs. This write-up focuses on Gemma 4, and was written by a member of Google DeepMind. MTP is used in speculative...

    I thought this was an excellent breakdown of the new MTP feature in local LLMs. This write-up focuses on Gemma 4, and was written by a member of Google DeepMind.

    MTP is used in speculative decoding. It couples a larger dense model with a smaller "draft" model. In the case of Gemma 4, the draft model only has 76M parameters.

    Typically with LLMs, calculating each token carries the same cost, even if the remaining tokens seem obvious or trivial. MTP works by letting the draft model take a crack at generating a handful of tokens, then passing those back to the larger model to verify. Because they share a lot of their cache data, this process is fast and even the misses have minimal cost.

    This approach trades a little vram (to load the draft model) for a substantial speedup of token generation. No accuracy is lost, as the larger model will take over in the case of any disagreement. From looking at some of the early tests, I'm seeing numbers like 1.85x speedup for an 11% vram increase.

    Support for MTP is coming soon to llama.cpp, and should then work its way out to derivatives like LM Studio. It will also support other MTP-enabled models like Qwen 3.6.

    3 votes