vLLM/Recipes
Qwen

Qwen/Qwen3-32B

Qwen3 32B dense model with hybrid thinking/non-thinking modes — verified on TPU v6e (Trillium).

Verified on TPU v6e (Trillium) and v7 (Ironwood) with BF16

dense32B40,960 ctxvLLM 0.8.5+text
Guide

Overview

Qwen3-32B is the flagship dense model in the Qwen3 series, with hybrid thinking / non-thinking modes. BF16 is the training precision and gives the best accuracy.

Toggle thinking mode per request with enable_thinking in chat_template_kwargs, or set the Reasoning feature to add the Qwen3 reasoning parser so <think> content is split into a separate field.

Prerequisites

NVIDIA

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

TPU (Trillium / v6e)

Use the official vLLM TPU image. The 32B model is served on a v6e-4 slice (--topology 2x2, 4 chips, TP=4).

export HF_HOME=/dev/shm
export HF_TOKEN=<your HF token>

Deployment Configurations

TPU v6e (Trillium, 2x2 slice, TP=4)

Verified end-to-end on a 4-chip v6e slice.

export HF_HOME=/dev/shm
export HF_TOKEN=<your HF token>
export TP=4
export MAX_MODEL_LEN=4096
vllm serve Qwen/Qwen3-32B \
  --seed 42 \
  --disable-log-requests \
  --gpu-memory-utilization 0.98 \
  --max-num-batched-tokens 2048 \
  --max-num-seqs 256 \
  --tensor-parallel-size $TP \
  --max-model-len $MAX_MODEL_LEN

TPU v7 (Ironwood, single host, TP=2)

Verified end-to-end on a single Ironwood host (tpu7x-standard-1t, topology 1x1x1). Served via the OpenAI API server entrypoint.

export HF_HOME=/data
export HUGGING_FACE_HUB_TOKEN=<your HF token>
export LAYOUT_Q_PROJ_AS_NDH=1
export USE_BATCHED_RPA_KERNEL=1
vllm serve Qwen/Qwen3-32B \
  --host 0.0.0.0 \
  --port 8000 \
  --seed 42 \
  --tensor-parallel-size 2 \
  --data-parallel-size 1 \
  --max-model-len 9216 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 135 \
  --gpu-memory-utilization 0.98 \
  --block-size 256 \
  --kv-cache-dtype fp8 \
  --no-enable-prefix-caching \
  --async-scheduling

Ironwood support ships in the nightly TPU image (e.g. vllm/vllm-tpu:nightly-20260330-2f76400-8c0b626).

Single node (BF16, TP=2)

vllm serve Qwen/Qwen3-32B \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2

Configuration Tips

  • --max-model-len 4096 keeps memory in check on a 4-chip v6e slice; native context is 40K — raise it if you have headroom.
  • vLLM uses 90% of device memory by default; the TPU recipe pushes --gpu-memory-utilization 0.98 to maximize KV cache.
  • --max-num-batched-tokens and --max-num-seqs tune batch throughput vs latency.
  • For thinking mode Qwen recommends temperature=0.6, top_p=0.95, top_k=20; for non-thinking, temperature=0.7, top_p=0.8.

Benchmarking

Launch the server with --no-enable-prefix-caching for consistent measurements.

vllm bench serve \
  --host 0.0.0.0 \
  --port 8000 \
  --model Qwen/Qwen3-32B \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 128

References