Qwen/Qwen3-4B
Qwen3 4B dense model with hybrid thinking/non-thinking modes — fits on a single TPU v6e chip or one GPU.
Verified on TPU v6e (Trillium) with BF16 on a single chip
Guide
Overview
Qwen3-4B is the small dense model in the Qwen3 series, with hybrid thinking / non-thinking modes. At 4B it fits on a single accelerator — one TPU v6e (Trillium) chip or a single 16GB+ GPU — making it the cheapest entry point in the family. BF16 is the training precision and gives the best accuracy.
Toggle thinking mode per request with enable_thinking in chat_template_kwargs, or set the Reasoning feature to add the Qwen3 reasoning parser so <think> content is split into a separate field.
Prerequisites
NVIDIA
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
TPU (Trillium / v6e)
Use the official vLLM TPU image. The 4B model fits on a single v6e chip (v6e-1, --topology 1x1).
export HF_HOME=/dev/shm
export HF_TOKEN=<your HF token>
Deployment Configurations
TPU v6e (Trillium, single chip, TP=1)
Verified end-to-end on a single v6e chip.
export HF_HOME=/dev/shm
export HF_TOKEN=<your HF token>
export TP=1
export MAX_MODEL_LEN=4096
vllm serve Qwen/Qwen3-4B \
--seed 42 \
--disable-log-requests \
--gpu-memory-utilization 0.98 \
--max-num-batched-tokens 1024 \
--max-num-seqs 128 \
--tensor-parallel-size $TP \
--max-model-len $MAX_MODEL_LEN
Single GPU (BF16)
vllm serve Qwen/Qwen3-4B \
--host 0.0.0.0 \
--port 8000
Configuration Tips
--max-model-len 4096is the TPU-recipe default; native context is 40K — raise it if you have headroom.- vLLM uses 90% of device memory by default; the TPU recipe pushes
--gpu-memory-utilization 0.98to maximize KV cache. - For thinking mode Qwen recommends
temperature=0.6, top_p=0.95, top_k=20; for non-thinking,temperature=0.7, top_p=0.8.
Benchmarking
Launch the server with --no-enable-prefix-caching for consistent measurements.
vllm bench serve \
--host 0.0.0.0 \
--port 8000 \
--model Qwen/Qwen3-4B \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 128