Google/gemma-4-E2B-it

Google's compact Gemma 4 multimodal model (effective 2B) with native text, image, and audio, plus thinking mode and tool-use protocol.

Compact unified multimodal model with audio, thinking, and function calling — runs on a single 24 GB+ GPU

View on HuggingFace

dense5B131,072 ctxvLLM 0.19.1+multimodaltext

Guide

Overview

Gemma 4 E2B is the smallest member of Google's Gemma 4 family — an effective-2B unified multimodal model that natively processes text, images, and audio, with structured thinking/reasoning, function calling, and dynamic vision resolution. It runs comfortably on a single 24 GB+ GPU.

Key Features

Multimodal: Text + images + audio natively (video via custom frame-extraction pipeline).
Dual Attention: Alternating sliding-window (local) and global attention with different head dimensions.
Thinking Mode: Structured reasoning via <|channel>thought\n...<channel|> delimiters.
Function Calling: Custom tool-call protocol with dedicated special tokens.
Dynamic Vision Resolution: Per-request configurable vision token budget (70, 140, 280, 560, 1120 tokens).

TPU support is provided through vLLM TPU with recipes for Trillium and Ironwood.

Prerequisites

pip (NVIDIA CUDA)

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
  --extra-index-url https://wheels.vllm.ai/nightly/cu129 \
  --extra-index-url https://download.pytorch.org/whl/cu129 \
  --index-strategy unsafe-best-match

pip (AMD ROCm: MI300X, MI325X, MI350X, MI355X)

Requires Python 3.12, ROCm 7.2.1, glibc >= 2.35 (Ubuntu 22.04+).

uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --pre \
  --extra-index-url https://wheels.vllm.ai/rocm/nightly/rocm721 --upgrade

pip (Intel Xeon 6 CPUs)

For Intel and AMD x86 CPUs, follow the CPU pre-built wheels installation instructions.

Docker

docker pull vllm/vllm-openai:gemma4-0505-cu129  # NVIDIA Hopper (H100/H200, CUDA 12.9)
docker pull vllm/vllm-openai:gemma4-0505-cu130  # NVIDIA Blackwell (B200/B300, CUDA 13.0)
docker pull vllm/vllm-openai-rocm:latest   # AMD
docker pull vllm/vllm-openai-cpu:latest-x86_64 # For Intel Xeon 6

Deployment Configurations

Quick Start (Single GPU)

vllm serve google/gemma-4-E2B-it \
  --max-model-len 32768

With Audio Support

vllm serve google/gemma-4-E2B-it \
  --max-model-len 8192 \
  --limit-mm-per-prompt '{"image": 4, "audio": 1}'

Full-Featured Server Launch

Enables text, image, audio, thinking, and tool calling:

vllm serve google/gemma-4-E2B-it \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --chat-template examples/tool_chat_template_gemma4.jinja \
  --limit-mm-per-prompt '{"image": 4, "audio": 1}' \
  --async-scheduling \
  --host 0.0.0.0 \
  --port 8000

Docker (NVIDIA)

docker run -itd --name gemma4-e2b \
  --ipc=host --network host --shm-size 16G --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:gemma4-0505-cu129 \
    --model google/gemma-4-E2B-it \
    --max-model-len 32768 \
    --host 0.0.0.0 --port 8000

Swap vllm/vllm-openai:gemma4-0505-cu129 for vllm/vllm-openai:gemma4-0505-cu130 on Blackwell (B200/B300).

Docker (AMD MI300X/MI325X/MI350X/MI355X)

docker run -itd --name gemma4-rocm \
  --ipc=host --network=host --privileged \
  --cap-add=CAP_SYS_ADMIN --device=/dev/kfd --device=/dev/dri \
  --group-add=video --cap-add=SYS_PTRACE \
  --security-opt=seccomp=unconfined --shm-size 16G \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-rocm:latest \
    --model google/gemma-4-E2B-it \
    --host 0.0.0.0 --port 8000

Docker (Cloud TPU — Trillium / Ironwood)

TPU uses the separate vllm/vllm-tpu image (no pip wheel). Pull the tag specified by the upstream Trillium or Ironwood recipe, then run:

docker run -itd --name gemma4-tpu \
  --privileged --network host --shm-size 16G \
  -v /dev/shm:/dev/shm -e HF_TOKEN=$HF_TOKEN \
  vllm/vllm-tpu:latest \
    --model google/gemma-4-E2B-it \
    --max-model-len 16384 \
    --disable_chunked_mm_input \
    --host 0.0.0.0 --port 8000

Intel Xeon 6 Deployment via Docker

Launch the x86 CPU vLLM Docker container for google/gemma-4-E2B-it:

docker run -itd --name gemma4-cpu \
  --network host \
  --shm-size 16g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_CPU_KVCACHE_SPACE=40 \
  -e VLLM_CPU_ATTN_SPLIT_KV=0 \
  vllm/vllm-openai-cpu:latest-x86_64 \
    --model google/gemma-4-E2B-it \
    --host 0.0.0.0 \
    --port 8000

Client Usage

Audio Transcription

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="google/gemma-4-E2B-it",
    messages=[{"role": "user", "content": [
        {"type": "audio_url", "audio_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/2/22/Beatbox_by_Wikipedia_user_Wikipedia_Brown.ogg"}},
        {"type": "text", "text": "Provide a verbatim, word-for-word transcription of the audio."},
    ]}],
    max_tokens=512,
)
print(response.choices[0].message.content)

Image Understanding

response = client.chat.completions.create(
    model="google/gemma-4-E2B-it",
    messages=[{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
        {"type": "text", "text": "Describe this image in detail."},
    ]}],
    max_tokens=1024,
)

Thinking Mode

Launch with reasoning parser, then enable per-request:

vllm serve google/gemma-4-E2B-it \
  --max-model-len 16384 \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --enable-auto-tool-choice \
  --chat-template examples/tool_chat_template_gemma4.jinja

Enable per-request via extra_body={"chat_template_kwargs": {"enable_thinking": True}}.

Configuration Tips

Set --max-model-len to match your workload (max 131072).
Image-only workloads: --limit-mm-per-prompt.audio 0.
Text-only workloads: --limit-mm-per-prompt '{"image": 0, "audio": 0}' to skip MM profiling.
--async-scheduling improves throughput.
FP8 KV cache (--kv-cache-dtype fp8) saves ~50% KV memory.

Quantized Variant

RedHatAI/gemma-4-E2B-it-FP8-dynamic is a pre-quantized FP8 (E4M3) checkpoint — linear weights with dynamic per-token activation quantization, vision/audio encoders kept in BF16. Runs on Hopper and Blackwell. Pick the fp8 variant above, or pass the repo id directly to vllm serve.

Speculative Decoding (MTP)

Enable the Spec Decoding feature toggle (above) or add --speculative-config manually to use MTP drafting with the assistant model. Recommended num_speculative_tokens: 2 for this model. The E2B assistant uses centroids masking for efficient sparse logit computation. See the Gemma 4 usage guide for details and benchmarks.

Note: MTP speculative decoding for Gemma 4 is only available on the vLLM nightly build — it has not yet landed in a stable release. Install via the nightly wheel (uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly/cu129 …) or use the vllm/vllm-openai:gemma4-0505-cu129 / vllm/vllm-openai:gemma4-0505-cu130 images above; the standard :latest stable tag does not include this feature.