Google's Gemma 4 Just Got 3x Faster - MTP Drafters Land for the Open Models

calendar_today Date: MAY 06, 2026

schedule Duration: 0:52

visibility Views: 137

database

Summary Report

Google has released MTP drafters for Gemma 4 that deliver up to 3x faster inference with no quality loss. They're on Hugging Face today under Apache 2.0.

01. Google released Multi-Token Prediction (MTP) drafters for the Gemma 4 family on May 5, 2026.
02. The drafters use speculative decoding to deliver up to a 3x inference speedup with no quality degradation.
03. A small drafter predicts multiple tokens at once, which the main Gemma 4 model verifies in parallel.
04. MTP applies to both the 26B MoE and 31B Dense Gemma 4 models, including on consumer GPUs and edge devices.
05. Drafters are available on Hugging Face and Kaggle under Apache 2.0, with support across transformers, MLX, vLLM, SGLang and Ollama.

Google has released MTP (multi-token prediction) drafters that accelerate Gemma 4 inference by up to three times without modifying the core models. The system uses speculative decoding, where smaller drafter models predict upcoming tokens whilst the main Gemma 4 model verifies these predictions in parallel. The approach maintains Gemma 4's reasoning capabilities whilst reducing latency through parallelisation rather than sequential processing. When the drafter's predictions align with what Gemma 4 would generate, the entire sequence is accepted in a single pass, significantly speeding up inference. The performance gains are particularly notable for the 26B MoE and 31B Dense Gemma 4 variants running on consumer hardware. For edge deployment, the faster inference translates to improved battery life on mobile devices, addressing a key constraint for local AI applications. The MTP drafters are available under the same Apache 2.0 licence as the core Gemma models, with immediate support across major inference frameworks including Hugging Face Transformers, MLX, vLLM, SGLang, and Ollama. This marks another step in making speculative decoding standard practice for optimising large language model inference without retraining.