Google/diffusiongemma-26B-A4B-it

Google's DiffusionGemma — a block-diffusion language model built on Gemma 4's MoE backbone (26B total / 4B active). Generates tokens via iterative denoising over a fixed-length canvas rather than left-to-right autoregressive decoding, enabling higher throughput with parallel block generation.

Block-diffusion MoE — 26B total / 4B active, canvas-based parallel generation with ~1.9x throughput vs autoregressive baseline

View on HuggingFace

moe26B / 4B262,144 ctxvLLM 0.24.0+multimodaltext

Guide

Overview

DiffusionGemma 26B-A4B is a block-diffusion language model built on Gemma 4's MoE backbone. Instead of generating tokens one at a time (autoregressive), it generates blocks of 256 tokens simultaneously via iterative denoising over a fixed canvas — trading higher time-to-first-token for significantly higher per-request generation throughput.

Key Architecture Features

Block Diffusion: Generates tokens in 256-token canvas blocks via iterative denoising (up to 48 steps per block).
MoE Backbone: Same Gemma 4 architecture — 128 fine-grained experts with top-8 routing, 26B total / 4B active parameters.
Entropy-Bound Sampling: Uses an entropy-based sampler (diffusion_sampler: entropy_bound, entropy_bound: 0.1) for the denoising process.
Multimodal: Supports text + images via the Gemma 4 vision encoder.
Thinking Mode: Supports structured reasoning via <|channel>thought\n...<channel|> delimiters.
Function Calling: Supports Gemma 4's tool-call protocol (works best in thinking mode).

Important Deployment Notes

DiffusionGemma requires several specific flags that differ from standard Gemma 4:

Flag	Why
`--max-num-seqs 4`	The diffusion state buffers (`self_conditioning_probs`) pre-allocate `max_seqs × canvas_length × vocab_size` tensors. With Gemma's 262K vocab and canvas_length=256, higher values cause OOM.
`--generation-config vllm`	The checkpoint's `generation_config.json` sets `max_tokens: 256`, which would override per-request limits. This flag ignores it.
`--gpu-memory-utilization 0.85`	Leaves headroom for activation memory during denoising.
`--hf-overrides '{"diffusion_sampler":"entropy_bound","diffusion_entropy_bound":0.1}'`	Configures the entropy-bound denoising sampler.
`--diffusion-config '{"canvas_length": 256}'`	Sets the canvas block size for generation.

Prerequisites

DiffusionGemma requires a vLLM build with diffusion model support, available in the Gemma docker image:

docker pull vllm/vllm-openai:gemma

Deployment

Single GPU (H100/H200, BF16)

docker run -itd --name diffusiongemma \
    --ipc=host --network host --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    vllm/vllm-openai:gemma \
        --model google/diffusiongemma-26B-A4B-it \
        --max-model-len 262144 \
        --max-num-seqs 4 \
        --gpu-memory-utilization 0.85 \
        --host 0.0.0.0 --port 8000

Full-Featured Server (text + image + thinking + tools)

docker run -itd --name diffusiongemma \
    --ipc=host --network host --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    vllm/vllm-openai:gemma \
        --model google/diffusiongemma-26B-A4B-it \
        --max-model-len 262144 \
        --max-num-seqs 4 \
        --gpu-memory-utilization 0.85 \
        --mm-processor-kwargs '{"max_soft_tokens": 1120}' \
        --limit-mm-per-prompt '{"image": 7}' \
        --enable-auto-tool-choice \
        --tool-call-parser gemma4 \
        --reasoning-parser gemma4 \
        --host 0.0.0.0 --port 8000

Client Usage

Text Generation

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="google/diffusiongemma-26B-A4B-it",
    messages=[{"role": "user", "content": "Write a poem about the ocean."}],
    max_tokens=512,
    temperature=0.7,
)
print(response.choices[0].message.content)

Offline Inference

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_path = "google/diffusiongemma-26B-A4B-it"
tokenizer = AutoTokenizer.from_pretrained(model_path)

llm = LLM(
    model=model_path,
    max_num_seqs=4,
    gpu_memory_utilization=0.85,
    hf_overrides={
        "diffusion_sampler": "entropy_bound",
        "diffusion_entropy_bound": 0.1,
    },
    diffusion_config={"canvas_length": 256},
)

messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompt, SamplingParams(temperature=0.0, max_tokens=1024))

print(outputs[0].outputs[0].text)

Thinking Mode

Launch the server with --reasoning-parser gemma4, then enable per-request:

response = client.chat.completions.create(
    model="google/diffusiongemma-26B-A4B-it",
    messages=[{"role": "user", "content": "What is the derivative of x^3 * ln(x)?"}],
    max_tokens=32768,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

Performance Characteristics

Compared to the autoregressive Gemma 4 26B-A4B baseline (single H100, concurrency=1, SPEED-Bench):

Metric	Gemma 4 26B-A4B (baseline)	DiffusionGemma 26B-A4B
Output TPS	199 tok/s	375 tok/s (1.9×)
E2E Request Time (mean)	2.87s	0.88s (3.3× faster)
TTFT (mean)	53ms	489ms (higher due to canvas denoising setup)
Per-request Gen TPS (mean)	205 tok/s	1,282 tok/s (6.2×)

The diffusion model trades higher time-to-first-token for significantly higher generation throughput. It generates fewer total tokens per prompt on average (shorter outputs) but completes requests much faster.

Known Limitations

TTFT: ~10× higher than autoregressive baseline — the model must denoise an entire canvas block before emitting the first token.
--max-num-seqs: Must be kept low (≤4) due to the large diffusion state tensors. Higher values cause CUDA OOM.
Audio: Not supported — the diffusion checkpoints do not include an audio encoder.
Function Calling: Works best in thinking mode. Non-thinking mode may answer directly instead of emitting tool calls.