vLLM/Recipes
Google

Google/gemma-4-12B-it

Google's encoder-free unified Gemma 4 dense model (12B) with native text, image, and audio, plus thinking mode and tool-use protocol.

Encoder-free unified multimodal model with audio, structured thinking, and function calling — runs on a single 40 GB+ GPU

dense12B131,072 ctxvLLM nightly+multimodaltext
Guide

Overview

Gemma 4 is Google's most capable open model family, featuring a unified multimodal architecture that natively processes text, images, and audio. The 12B Unified model is encoder-free: instead of dedicated vision/audio encoders, it projects raw image patches and audio waveforms directly into the decoder's embedding space through lightweight linear layers, so every modality flows into a single decoder-only transformer. Gemma 4 models support structured thinking/reasoning, function calling with a custom tool-use protocol, and dynamic vision resolution — all available through vLLM's OpenAI-compatible API.

Key Features

  • Encoder-free multimodality: Text, image, and audio inputs handled by a single decoder-only transformer — no separate encoders to load, reducing multimodal latency.
  • Dual Attention: Alternating sliding-window (local, 1024 tokens) and global attention; the final layer is always global, with unified Keys/Values and Proportional RoPE on global layers for long-context efficiency.
  • Thinking Mode: Structured reasoning via <|channel>thought\n...<channel|> delimiters.
  • Function Calling: Custom tool-call protocol with dedicated special tokens.
  • Native System Prompt Support: First-class system role.

Supported Variants

Dense:

  • google/gemma-4-E2B-it (effective 2B)
  • google/gemma-4-E4B-it (effective 4B)
  • google/gemma-4-12B-it (11.95B, encoder-free)
  • google/gemma-4-31B-it (31B)

MoE:

  • google/gemma-4-26B-A4B-it (26B total / 4B active)

Context length: this recipe pins context_length to the value in config.json (max_position_embeddings = 131072, i.e. 128K). The Gemma 4 model card markets the 12B at up to 256K — raise --max-model-len only if your vLLM build accepts it.

TPU support is provided through vLLM TPU with recipes for Trillium and Ironwood.

Nightly required: support for the encoder-free 12B Unified model landed in vllm-project/vllm#44429 and has not yet shipped in a stable release. Use the pinned Docker image (recommended) or a nightly pip wheel.

Prerequisites

docker pull vllm/vllm-openai:gemma4-unified            # NVIDIA (CUDA 13; append -cu129 for CUDA 12.9 hosts)

TPU images are published separately by vllm-project/tpu-inference; see the Trillium / Ironwood tpu-recipes below for the pinned tag.

pip (nightly, NVIDIA CUDA)

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
  --extra-index-url https://wheels.vllm.ai/nightly/cu129 \
  --extra-index-url https://download.pytorch.org/whl/cu129 \
  --index-strategy unsafe-best-match

Deployment Configurations

Quick Start (Single GPU, BF16)

The 12B fits comfortably on a single 40 GB+ GPU.

vllm serve google/gemma-4-12B-it \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Enables text, image, audio, thinking, and tool calling:

vllm serve google/gemma-4-12B-it \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --chat-template examples/tool_chat_template_gemma4.jinja \
  --limit-mm-per-prompt '{"image": 4, "audio": 1}' \
  --async-scheduling \
  --host 0.0.0.0 \
  --port 8000

Docker (NVIDIA)

docker run -itd --name gemma4 \
  --ipc=host --network host --shm-size 16G --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:gemma4-unified \
    --model google/gemma-4-12B-it \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90 \
    --host 0.0.0.0 --port 8000

On CUDA 12.9 hosts, use the vllm/vllm-openai:gemma4-unified-cu129 tag instead.

Docker (Cloud TPU — Trillium / Ironwood)

TPU uses the separate vllm/vllm-tpu image (no pip wheel). Pull the tag specified by the upstream Trillium or Ironwood recipe, then run:

docker run -itd --name gemma4-tpu \
  --privileged --network host --shm-size 16G \
  -v /dev/shm:/dev/shm -e HF_TOKEN=$HF_TOKEN \
  vllm/vllm-tpu:latest \
    --model google/gemma-4-12B-it \
    --max-model-len 16384 \
    --disable_chunked_mm_input \
    --host 0.0.0.0 --port 8000

Client Usage

Text Generation

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="google/gemma-4-12B-it",
    messages=[{"role": "user", "content": "Write a poem about the ocean."}],
    max_tokens=512, temperature=0.7,
)
print(response.choices[0].message.content)

Image Understanding

response = client.chat.completions.create(
    model="google/gemma-4-12B-it",
    messages=[{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
        {"type": "text", "text": "Describe this image in detail."},
    ]}],
    max_tokens=1024,
)

Audio

Requires uv pip install "vllm[audio]".

vllm serve google/gemma-4-12B-it \
  --max-model-len 8192 \
  --limit-mm-per-prompt '{"image": 4, "audio": 1}'

Thinking Mode

vllm serve google/gemma-4-12B-it \
  --max-model-len 16384 \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --chat-template examples/tool_chat_template_gemma4.jinja

Enable thinking per-request via extra_body={"chat_template_kwargs": {"enable_thinking": True}}, or default-on with --default-chat-template-kwargs '{"enable_thinking": true}'.

Structured Outputs

vLLM guided decoding constrains output to a JSON schema. Include semantic instructions in the system prompt — the model does not see schema descriptions.

Configuration Tips

  • Set --max-model-len to match your workload.
  • --gpu-memory-utilization 0.90–0.95 maximizes KV cache.
  • Image-only workloads: pass --limit-mm-per-prompt.audio 0.
  • Text-only workloads: pass --limit-mm-per-prompt '{"image": 0, "audio": 0}' to skip MM profiling.
  • --async-scheduling improves throughput.
  • FP8 KV cache (--kv-cache-dtype fp8) saves ~50% KV memory.

Speculative Decoding (MTP)

Enable the Spec Decoding feature toggle (above) or add --speculative-config manually to use MTP drafting with the assistant model. Recommended num_speculative_tokens: 4–8 for this model.

Note: MTP speculative decoding for Gemma 4 requires the nightly build — the pinned vllm/vllm-openai:gemma4-unified image or a nightly pip wheel. The standard :latest stable tag does not include this feature.

References