DiffusionGemma: open-weight text diffusion

Every large language model you’ve used works the same way underneath. It predicts the next token, appends it, then predicts the next one, over and over, left to right. The whole field is built on that single autoregressive loop. It’s also the reason generation feels sequential — each token has to wait for the one before it.

On June 10, Google DeepMind released DiffusionGemma, an open-weight model that throws that loop out. Instead of writing one token at a time, it starts with a block of 256 garbage tokens and refines the whole thing in parallel, the way an image diffusion model turns static into a photo. It’s the first major open text diffusion model, it ships under Apache 2.0, and on an H100 it reportedly clears 1,000 tokens per second. This is worth understanding even if you never run it.

What DiffusionGemma actually is

DiffusionGemma is a mixture-of-experts model built on Google’s Gemma 4 architecture, with the autoregressive decoding swapped for a diffusion head. The marketing name — “26B-A4B” — encodes the shape: roughly 26 billion total parameters, about 3.8 billion active on any given step.

Spec	DiffusionGemma 26B-A4B
Total parameters	~26B
Active parameters	~3.8B
Experts	128 total, 8 active + 1 shared
Canvas (block) length	256 tokens
Context length	up to 256K
Vocabulary	~262K
License	Apache 2.0

The MoE structure is the same trick everyone’s using now: a large pool of knowledge, only a thin slice of it touched per forward pass, so you pay inference cost closer to a 4B model than a 26B one. Google quotes a memory footprint of roughly 18GB when quantised, which lands it on a single high-end consumer GPU. That part isn’t novel. The diffusion head is.

How diffusion text generation works

Autoregressive models generate causally. Token five can see tokens one through four and nothing ahead of it. That constraint is what forces the left-to-right march.

A diffusion model drops that mask inside a block of tokens. Here’s the loop DiffusionGemma runs instead:

Lay down a “canvas” of 256 placeholder tokens — masked, essentially noise.
Run a forward pass where every position attends to every other position, in both directions, and predict the whole block at once.
Keep the predictions the model is most confident about, re-mask the rest, and run it again.
Repeat for a handful of denoising steps until the block resolves into coherent text.

Each pass sharpens the canvas. The confident tokens anchor the uncertain ones around them — once the model is sure position 40 is the word “function”, that constrains what positions 39 and 41 can plausibly be. Because every position is visible to every other, the model can fix a word it laid down two steps earlier. Autoregressive models can’t do that; once a token is emitted, it’s committed. DiffusionGemma self-corrects as it goes.

It’s worth being precise about the “parallel” claim, because it’s not a free-for-all. DiffusionGemma is block-autoregressive: it denoises one 256-token canvas to completion, encodes it into the KV cache, then starts the next canvas with the finished text behind it as context. So the long-range structure of a document is still built left to right, block by block. The parallelism — and the bidirectional attention — lives inside each block. That hybrid is the design: autoregressive where order matters across the whole output, diffusion where it can buy speed within a window.

The number of denoising steps is a dial. Fewer steps means faster output and rougher text; more steps means slower and cleaner. Reported defaults sit well under the 256 you’d need if you were generating token by token, which is the whole point — you’re trading sequential steps for parallel compute.

The speed story

This is the headline, and it holds up. The vLLM team measured the FP8 checkpoint at around 1,008 tokens per second on a single H100, making DiffusionGemma the first diffusion LLM with native vLLM support. Other reports put an RTX 5090 north of 700 tokens per second. Google’s own framing is “up to 4x” the throughput of a comparable autoregressive baseline.

The reason the gain shows up most at low batch sizes is worth sitting with, because it explains who actually benefits. When you’re serving one request, an autoregressive model is bottlenecked on memory bandwidth — it streams the entire weight set through the chip to produce a single token, then does it again. The expensive compute units sit mostly idle. DiffusionGemma takes that idle compute and spends it: it processes a whole 256-token block per pass, so the same memory read does far more work. You’re converting spare FLOPs into finished tokens.

That trade flips at high batch sizes, where autoregressive serving already keeps the hardware busy by batching many users together. So DiffusionGemma’s advantage is sharpest exactly where local and single-user inference lives — your machine, one prompt at a time. That’s the bet Google is making with an open release.

The quality tradeoff

Speed isn’t free, and Google says so plainly. The reported benchmark numbers put DiffusionGemma below the autoregressive Gemma 4 model it’s built on, fairly consistently:

Benchmark	DiffusionGemma	Gemma 4 26B-A4B
MMLU-Pro	77.6%	82.6%
GPQA Diamond	73.2%	82.3%
AIME 2026	69.1%	88.3%
LiveCodeBench v6	69.1%	77.1%
Codeforces (ELO)	1429	1718

Treat these as the reported figures rather than gospel — they come from early coverage of the model card. The shape is what matters: a few points down on knowledge and code, a larger drop on hard math like AIME. The gap widens exactly where dependencies between tokens are tightest, which is what you’d predict from a model that commits tokens in parallel.

Google’s own framing in the announcement is refreshingly blunt: because the model “prioritizes speed and parallel layout generation,” its output quality is lower than standard Gemma 4, and for applications that demand maximum quality they recommend deploying standard Gemma 4 instead. This is an experimental branch, not a flagship. If you want the best answer and you’re willing to wait, autoregressive models still win. If you want a good answer right now, the calculus changes.

There’s also a structural limitation baked into the canvas approach. The model denoises a fixed-size block, so generation isn’t open-ended in the way next-token prediction is. Producing long outputs means stitching blocks together, and the architecture is happier with bounded, structured generation than with rambling.

Where it’s genuinely better

The fixed-block, bidirectional design isn’t only a constraint. It’s a structural advantage for a specific class of task: infilling.

Autoregressive models are bad at filling a gap in the middle of existing text, because they only know how to append. You have to play tricks with prompting to get them to respect what comes after the hole. DiffusionGemma treats the gap as just more canvas to denoise, with the surrounding text acting as fixed context on both sides. Code completion in the middle of a function, filling a template, any task where constraints come from both directions — this is where bidirectional attention pays off.

So the sweet spot is latency-sensitive, structured generation: autocomplete, code infilling, form filling, fast drafting in an interactive loop. Not long-form reasoning where you’d reach for a slower, sharper model anyway.

Running it yourself

DiffusionGemma landed with day-one support across the usual stack — Hugging Face Transformers, vLLM, SGLang, MLX, and Unsloth — plus community quantisations for llama.cpp, Ollama, and LM Studio. NVIDIA also published an NVFP4 build for its hardware.

The Transformers path uses a dedicated model class rather than the usual AutoModelForCausalLM, because the generation loop is fundamentally different:

python

from transformers import DiffusionGemmaForBlockDiffusion, AutoProcessor

model = DiffusionGemmaForBlockDiffusion.from_pretrained(
    "google/diffusiongemma-26B-A4B-it",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("google/diffusiongemma-26B-A4B-it")

messages = [{"role": "user", "content": "Write a haiku about parallelism."}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(output[0], skip_special_tokens=True))

One gotcha worth flagging from the docs: DiffusionGemma doesn’t accept use_cache — it always uses a KV cache internally, and you set cache_implementation="static" if you want to torch.compile the model. The KV-cache story is different from autoregressive serving, which is part of why it needed dedicated support in vLLM rather than dropping into the existing path.

For a server, vLLM exposes it through the same OpenAI-compatible endpoint as everything else:

bash

vllm serve google/diffusiongemma-26B-A4B-it

Then point any OpenAI client at it. The diffusion mechanics stay under the hood; from the API’s perspective you send a prompt and get tokens back. If you want to watch the canvas resolve, Transformers ships a streamer that surfaces the intermediate drafts as the model denoises — a genuinely strange thing to watch, since the text doesn’t grow left to right, it sharpens into focus all at once.

What the community thinks

The reception on Hacker News and r/LocalLLaMA has been cautiously positive, with the emphasis on cautious.

The excitement is about the direction more than this specific checkpoint. An open, downloadable diffusion LLM with real tooling behind it is something the community has wanted since Inception Labs’ Mercury and the LLaDA papers made the case that text diffusion could be fast. Google shipping one under Apache 2.0, with vLLM support on day one, moves it from research curiosity to something you can actually build on.

The skepticism is practical, and most of it is about tooling rather than the model. The diffusion decode loop doesn’t map onto the runtimes people actually use locally — early on, mlx-lm returned a flat “Model type diffusion_gemma not supported”, and LM Studio couldn’t load it either. The vLLM and Transformers paths worked on day one; the cosy local stack did not. There was also a round of confusion when NVIDIA’s NIM endpoint shipped with an 8,192-token context default, which several people mistook for an architecture limit rather than a config knob on a model that handles 256K.

Underneath the tooling gripes sits the real reservation: quality trails Gemma 4, and the fixed-length canvas rubs against expectations set by a decade of autoregressive models. The recurring take is that DiffusionGemma is a strong proof of concept and a genuine speed win for the right workload, not a Gemma 4 replacement. There’s also a research-side rebuttal worth knowing about — benchmarks like ParallelBench argue that parallel decoding quietly degrades on tasks with tight token dependencies, the kind standard evals don’t probe. The speed is real; whether it survives contact with your specific workload is the open question.

Where this leaves you

DiffusionGemma is the open counterpart to Google’s earlier, closed Gemini Diffusion work — the same idea, now with weights you can download and inspect.

If you’re building something latency-sensitive — autocomplete, code infilling, fast interactive drafting — it’s worth benchmarking against your current model. The throughput holds up and the infilling behaviour is a genuine architectural edge, not a marketing line. If you want maximum quality on hard reasoning or long-form generation, stick with autoregressive Gemma 4 or a frontier model; that’s not what this is for.

The longer-term point is the one to hold onto. For the entire modern era of language models, “generate text” has meant “predict the next token”. DiffusionGemma is the first open model serious enough to make you question whether that has to be true. Even if this particular checkpoint stays a niche tool, the loop it breaks is the interesting part.