Diffusion Gemma is 4x faster but hallucinates 6x more facts
A benchmark on a single H100 (FP8) found DiffusionGemma 26B A4B generates tokens at 763 tok/s — roughly 4x faster than Gemma4 26B A4B's 218 tok/s — but produced 28 factual errors versus Gemma4's 5, a roughly 6x higher mistake rate.
Score breakdown
DiffusionGemma's parallel token-generation architecture produces fluent but factually unreliable text, with error rates that grow as topics become more obscure — a concrete limitation that distinguishes it from its autoregressive counterpart for any fact-sensitive use case.
- 01Benchmark run on a single H100 in FP8 precision, comparing DiffusionGemma 26B A4B vs. Gemma4 26B A4B
- 02Gemma4: 218 tok/s, 15.1s total, 45 facts correct, 5 mistakes
- 03DiffusionGemma: 763 tok/s, 3.7s total, 33 facts correct, 28 mistakes
u/gladkos ran a head-to-head benchmark between DiffusionGemma 26B A4B and Gemma4 26B A4B on a single H100 in FP8 precision. Both models were given three identical writing tasks — a Steve Jobs biography, the history of Tetris, and the story of BeOS — chosen deliberately in order of decreasing topic popularity. Every factual claim in every response was then manually verified.
Specific hallucinations included naming "Clara Clley" as Steve Jobs' mother, inventing a Tetris colleague for Pajitnov named "Geri Gulovik," and pricing the BeBox at $9,999 when the real price was $1,600.
Gemma4 produced 45 correct facts and only 5 errors. DiffusionGemma produced 33 correct facts and 28 errors, with the error rate climbing sharply as topics became more obscure: 4 mistakes on Jobs, 12 on Tetris, and 12 on BeOS. Specific hallucinations included naming "Clara Clley" as Steve Jobs' mother, inventing a Tetris colleague for Pajitnov named "Geri Gulovik," and pricing the BeBox at $9,999 when the real price was $1,600. On throughput, DiffusionGemma ran at 763 tok/s and completed its outputs in 3.7 seconds total, compared to Gemma4's 218 tok/s and 15.1 seconds.
The post explains the accuracy gap through the models' differing generation strategies: DiffusionGemma places 256 tokens on screen simultaneously and iteratively polishes them for fluency, meaning a fabricated name or number that sounds plausible survives the refinement passes unchanged. Autoregressive Gemma4, by contrast, generates one token at a time and conditions each new token on all prior context. The post also notes that Google's own launch post acknowledges the quality trade-off and recommends using regular Gemma 4 when factual accuracy matters.
Key facts
- 01Benchmark run on a single H100 in FP8 precision, comparing DiffusionGemma 26B A4B vs. Gemma4 26B A4B
- 02Gemma4: 218 tok/s, 15.1s total, 45 facts correct, 5 mistakes
- 03DiffusionGemma: 763 tok/s, 3.7s total, 33 facts correct, 28 mistakes
- 04DiffusionGemma errors increased with topic obscurity: 4 mistakes on Jobs, 12 on Tetris, 12 on BeOS
- 05Hallucinations included naming 'Clara Clley' as Jobs' mother, inventing a colleague 'Geri Gulovik' for Pajitnov, and pricing the BeBox at $9,999 (real price: $1,600)
- 06DiffusionGemma generates 256 tokens simultaneously and refines for fluency, not factual accuracy
- 07Google's own launch post acknowledges the quality trade-off and recommends regular Gemma 4 when facts matter
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 14, 2026 · 09:08 UTC. How this works →