Qwen/Qwen3-4B

Qwen3 4B dense model with hybrid thinking/non-thinking modes — fits on a single TPU v6e chip, one GPU or one Xeon 6 NUMA node.

Verified on TPU v6e (Trillium) with BF16 on a single chip

View on HuggingFace

dense4B40,960 ctxvLLM 0.8.5+text

Guide

Overview

Qwen3-4B is the small dense model in the Qwen3 series, with hybrid thinking / non-thinking modes. At 4B it fits on a single accelerator — one TPU v6e (Trillium) chip, a single 16GB+ GPU or one Xeon 6 NUMA node — making it the cheapest entry point in the family. BF16 is the training precision and gives the best accuracy.

Toggle thinking mode per request with enable_thinking in chat_template_kwargs, or set the Reasoning feature to add the Qwen3 reasoning parser so <think> content is split into a separate field.

Prerequisites

NVIDIA

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

pip (Intel Xeon 6 CPUs)

For Intel and AMD x86 CPUs, follow the CPU pre-built wheels installation instructions.

TPU (Trillium / v6e)

Use the official vLLM TPU image. The 4B model fits on a single v6e chip (v6e-1, --topology 1x1).

export HF_HOME=/dev/shm
export HF_TOKEN=<your HF token>

Deployment Configurations

TPU v6e (Trillium, single chip, TP=1)

Verified end-to-end on a single v6e chip.

export HF_HOME=/dev/shm
export HF_TOKEN=<your HF token>
export TP=1
export MAX_MODEL_LEN=4096
vllm serve Qwen/Qwen3-4B \
  --seed 42 \
  --disable-log-requests \
  --gpu-memory-utilization 0.98 \
  --max-num-batched-tokens 1024 \
  --max-num-seqs 128 \
  --tensor-parallel-size $TP \
  --max-model-len $MAX_MODEL_LEN

Single GPU (BF16)

vllm serve Qwen/Qwen3-4B \
  --host 0.0.0.0 \
  --port 8000

Intel Xeon 6 Deployment via Docker

Launch the x86 CPU vLLM Docker container for Qwen/Qwen3-4B:

docker run -itd --name qwen4b-cpu \
  --network host \
  --shm-size 16g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-cpu:latest-x86_64 \
    --model Qwen/Qwen3-4B \
    --host 0.0.0.0 \
    --port 8000

Configuration Tips

--max-model-len 4096 is the TPU-recipe default; native context is 40K — raise it if you have headroom.
vLLM uses 90% of device memory by default; the TPU recipe pushes --gpu-memory-utilization 0.98 to maximize KV cache.
For thinking mode Qwen recommends temperature=0.6, top_p=0.95, top_k=20; for non-thinking, temperature=0.7, top_p=0.8.

Benchmarking

Launch the server with --no-enable-prefix-caching for consistent measurements.

vllm bench serve \
  --host 0.0.0.0 \
  --port 8000 \
  --model Qwen/Qwen3-4B \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 128