Google/gemma-4-12B-it
Google's encoder-free unified Gemma 4 dense model (12B) with native text, image, and audio, plus thinking mode and tool-use protocol.
Encoder-free unified multimodal model with audio, structured thinking, and function calling — runs on a single 40 GB+ GPU
Guide
Overview
Gemma 4 is Google's most capable open model family, featuring a unified multimodal architecture that natively processes text, images, and audio. The 12B Unified model is encoder-free: instead of dedicated vision/audio encoders, it projects raw image patches and audio waveforms directly into the decoder's embedding space through lightweight linear layers, so every modality flows into a single decoder-only transformer. Gemma 4 models support structured thinking/reasoning, function calling with a custom tool-use protocol, and dynamic vision resolution — all available through vLLM's OpenAI-compatible API.
Key Features
- Encoder-free multimodality: Text, image, and audio inputs handled by a single decoder-only transformer — no separate encoders to load, reducing multimodal latency.
- Dual Attention: Alternating sliding-window (local, 1024 tokens) and global attention; the final layer is always global, with unified Keys/Values and Proportional RoPE on global layers for long-context efficiency.
- Thinking Mode: Structured reasoning via
<|channel>thought\n...<channel|>delimiters. - Function Calling: Custom tool-call protocol with dedicated special tokens.
- Native System Prompt Support: First-class
systemrole.
Supported Variants
Dense:
google/gemma-4-E2B-it(effective 2B)google/gemma-4-E4B-it(effective 4B)google/gemma-4-12B-it(11.95B, encoder-free)google/gemma-4-31B-it(31B)
MoE:
google/gemma-4-26B-A4B-it(26B total / 4B active)
Context length: this recipe pins
context_lengthto the value inconfig.json(max_position_embeddings = 131072, i.e. 128K). The Gemma 4 model card markets the 12B at up to 256K — raise--max-model-lenonly if your vLLM build accepts it.
TPU support is provided through vLLM TPU with recipes for Trillium and Ironwood.
Nightly required: support for the encoder-free 12B Unified model landed in vllm-project/vllm#44429 and has not yet shipped in a stable release. Use the pinned Docker image (recommended) or a nightly pip wheel.
Prerequisites
Docker (recommended)
docker pull vllm/vllm-openai:gemma4-unified # NVIDIA (CUDA 13; append -cu129 for CUDA 12.9 hosts)
TPU images are published separately by vllm-project/tpu-inference; see the Trillium / Ironwood tpu-recipes below for the pinned tag.
pip (nightly, NVIDIA CUDA)
uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
--extra-index-url https://wheels.vllm.ai/nightly/cu129 \
--extra-index-url https://download.pytorch.org/whl/cu129 \
--index-strategy unsafe-best-match
Deployment Configurations
Quick Start (Single GPU, BF16)
The 12B fits comfortably on a single 40 GB+ GPU.
vllm serve google/gemma-4-12B-it \
--max-model-len 32768 \
--gpu-memory-utilization 0.90
Full-Featured Server Launch
Enables text, image, audio, thinking, and tool calling:
vllm serve google/gemma-4-12B-it \
--max-model-len 16384 \
--gpu-memory-utilization 0.90 \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--chat-template examples/tool_chat_template_gemma4.jinja \
--limit-mm-per-prompt '{"image": 4, "audio": 1}' \
--async-scheduling \
--host 0.0.0.0 \
--port 8000
Docker (NVIDIA)
docker run -itd --name gemma4 \
--ipc=host --network host --shm-size 16G --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:gemma4-unified \
--model google/gemma-4-12B-it \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 --port 8000
On CUDA 12.9 hosts, use the vllm/vllm-openai:gemma4-unified-cu129 tag instead.
Docker (Cloud TPU — Trillium / Ironwood)
TPU uses the separate vllm/vllm-tpu image (no pip wheel). Pull the tag specified by the upstream Trillium or Ironwood recipe, then run:
docker run -itd --name gemma4-tpu \
--privileged --network host --shm-size 16G \
-v /dev/shm:/dev/shm -e HF_TOKEN=$HF_TOKEN \
vllm/vllm-tpu:latest \
--model google/gemma-4-12B-it \
--max-model-len 16384 \
--disable_chunked_mm_input \
--host 0.0.0.0 --port 8000
Client Usage
Text Generation
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="google/gemma-4-12B-it",
messages=[{"role": "user", "content": "Write a poem about the ocean."}],
max_tokens=512, temperature=0.7,
)
print(response.choices[0].message.content)
Image Understanding
response = client.chat.completions.create(
model="google/gemma-4-12B-it",
messages=[{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
{"type": "text", "text": "Describe this image in detail."},
]}],
max_tokens=1024,
)
Audio
Requires uv pip install "vllm[audio]".
vllm serve google/gemma-4-12B-it \
--max-model-len 8192 \
--limit-mm-per-prompt '{"image": 4, "audio": 1}'
Thinking Mode
vllm serve google/gemma-4-12B-it \
--max-model-len 16384 \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--chat-template examples/tool_chat_template_gemma4.jinja
Enable thinking per-request via extra_body={"chat_template_kwargs": {"enable_thinking": True}}, or default-on with --default-chat-template-kwargs '{"enable_thinking": true}'.
Structured Outputs
vLLM guided decoding constrains output to a JSON schema. Include semantic instructions in the system prompt — the model does not see schema descriptions.
Configuration Tips
- Set
--max-model-lento match your workload. --gpu-memory-utilization 0.90–0.95maximizes KV cache.- Image-only workloads: pass
--limit-mm-per-prompt.audio 0. - Text-only workloads: pass
--limit-mm-per-prompt '{"image": 0, "audio": 0}'to skip MM profiling. --async-schedulingimproves throughput.- FP8 KV cache (
--kv-cache-dtype fp8) saves ~50% KV memory.
Speculative Decoding (MTP)
Enable the Spec Decoding feature toggle (above) or add --speculative-config manually to use MTP drafting with the assistant model. Recommended num_speculative_tokens: 4–8 for this model.
Note: MTP speculative decoding for Gemma 4 requires the nightly build — the pinned
vllm/vllm-openai:gemma4-unifiedimage or a nightly pip wheel. The standard:lateststable tag does not include this feature.