vLLM/Recipes
Google

Google/gemma-4-E2B-it

Google's compact Gemma 4 multimodal model (effective 2B) with native text, image, and audio, plus thinking mode and tool-use protocol.

View on HuggingFace
dense5B131,072 ctxvLLM 0.19.1+multimodaltext
Guide

Overview

Gemma 4 E2B is the smallest member of Google's Gemma 4 family — an effective-2B unified multimodal model that natively processes text, images, and audio, with structured thinking/reasoning, function calling, and dynamic vision resolution. It runs comfortably on a single 24 GB+ GPU.

Key Features

  • Multimodal: Text + images + audio natively (video via custom frame-extraction pipeline).
  • Dual Attention: Alternating sliding-window (local) and global attention with different head dimensions.
  • Thinking Mode: Structured reasoning via <|channel>thought\n...<channel|> delimiters.
  • Function Calling: Custom tool-call protocol with dedicated special tokens.
  • Dynamic Vision Resolution: Per-request configurable vision token budget (70, 140, 280, 560, 1120 tokens).

TPU support is provided through vLLM TPU with recipes for Trillium and Ironwood.

Prerequisites

pip (NVIDIA CUDA)

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
  --extra-index-url https://wheels.vllm.ai/nightly/cu129 \
  --extra-index-url https://download.pytorch.org/whl/cu129 \
  --index-strategy unsafe-best-match

pip (AMD ROCm: MI300X, MI325X, MI350X, MI355X)

Requires Python 3.12, ROCm 7.2.1, glibc >= 2.35 (Ubuntu 22.04+).

uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --pre \
  --extra-index-url https://wheels.vllm.ai/rocm/nightly/rocm721 --upgrade

Docker

docker pull vllm/vllm-openai:gemma4        # CUDA 12.9
docker pull vllm/vllm-openai:gemma4-cu130  # CUDA 13.0
docker pull vllm/vllm-openai-rocm:gemma4   # AMD

Deployment Configurations

Quick Start (Single GPU)

vllm serve google/gemma-4-E2B-it \
  --max-model-len 32768

With Audio Support

vllm serve google/gemma-4-E2B-it \
  --max-model-len 8192 \
  --limit-mm-per-prompt image=4,audio=1

Enables text, image, audio, thinking, and tool calling:

vllm serve google/gemma-4-E2B-it \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --chat-template examples/tool_chat_template_gemma4.jinja \
  --limit-mm-per-prompt image=4,audio=1 \
  --async-scheduling \
  --host 0.0.0.0 \
  --port 8000

Docker (NVIDIA)

docker run -itd --name gemma4-e2b \
  --ipc=host --network host --shm-size 16G --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:gemma4 \
    --model google/gemma-4-E2B-it \
    --max-model-len 32768 \
    --host 0.0.0.0 --port 8000

Docker (AMD MI300X/MI325X/MI350X/MI355X)

docker run -itd --name gemma4-rocm \
  --ipc=host --network=host --privileged \
  --cap-add=CAP_SYS_ADMIN --device=/dev/kfd --device=/dev/dri \
  --group-add=video --cap-add=SYS_PTRACE \
  --security-opt=seccomp=unconfined --shm-size 16G \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-rocm:gemma4 \
    --model google/gemma-4-E2B-it \
    --host 0.0.0.0 --port 8000

Docker (Cloud TPU — Trillium / Ironwood)

TPU uses the separate vllm/vllm-tpu image (no pip wheel). Pull the tag specified by the upstream Trillium or Ironwood recipe, then run:

docker run -itd --name gemma4-tpu \
  --privileged --network host --shm-size 16G \
  -v /dev/shm:/dev/shm -e HF_TOKEN=$HF_TOKEN \
  vllm/vllm-tpu:latest \
    --model google/gemma-4-E2B-it \
    --max-model-len 16384 \
    --disable_chunked_mm_input \
    --host 0.0.0.0 --port 8000

Client Usage

Audio Transcription

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="google/gemma-4-E2B-it",
    messages=[{"role": "user", "content": [
        {"type": "audio_url", "audio_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/2/22/Beatbox_by_Wikipedia_user_Wikipedia_Brown.ogg"}},
        {"type": "text", "text": "Provide a verbatim, word-for-word transcription of the audio."},
    ]}],
    max_tokens=512,
)
print(response.choices[0].message.content)

Image Understanding

response = client.chat.completions.create(
    model="google/gemma-4-E2B-it",
    messages=[{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
        {"type": "text", "text": "Describe this image in detail."},
    ]}],
    max_tokens=1024,
)

Thinking Mode

Launch with reasoning parser, then enable per-request:

vllm serve google/gemma-4-E2B-it \
  --max-model-len 16384 \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --enable-auto-tool-choice \
  --chat-template examples/tool_chat_template_gemma4.jinja

Enable per-request via extra_body={"chat_template_kwargs": {"enable_thinking": True}}.

Configuration Tips

  • Set --max-model-len to match your workload (max 131072).
  • Image-only workloads: --limit-mm-per-prompt audio=0.
  • Text-only workloads: --limit-mm-per-prompt image=0,audio=0 to skip MM profiling.
  • --async-scheduling improves throughput.
  • FP8 KV cache (--kv-cache-dtype fp8) saves ~50% KV memory.

References