Qwen/Qwen2.5-32B

Qwen2.5 32B dense base (pretrained) language model for text completion — verified on TPU v6e (Trillium).

Verified on TPU v6e (Trillium) with BF16, TP=4 on a 2x2 slice

View on HuggingFace

dense32B131,072 ctxvLLM 0.6.2+text

Guide

Overview

Qwen2.5-32B is the 32B dense base (pretrained, non-instruct) model in the Qwen2.5 series. It's a text-completion model — use the /v1/completions endpoint rather than chat, or pick Qwen/Qwen2.5-32B-Instruct if you need a chat-tuned variant. BF16 is the training precision and gives the best accuracy.

Prerequisites

NVIDIA

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

TPU (Trillium / v6e)

Use the official vLLM TPU image. The 32B model is served on a v6e-4 slice (--topology 2x2, 4 chips, TP=4).

export HF_HOME=/dev/shm
export HF_TOKEN=<your HF token>

Deployment Configurations

TPU v6e (Trillium, 2x2 slice, TP=4)

Verified end-to-end on a 4-chip v6e slice.

export HF_HOME=/dev/shm
export HF_TOKEN=<your HF token>
export TP=4
export MAX_MODEL_LEN=4096
VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-32B \
  --seed 42 \
  --gpu-memory-utilization 0.98 \
  --max-num-batched-tokens 2048 \
  --max-num-seqs 128 \
  --tensor-parallel-size $TP \
  --max-model-len $MAX_MODEL_LEN

Single GPU / node (BF16)

vllm serve Qwen/Qwen2.5-32B \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2

Configuration Tips

--max-model-len 4096 keeps memory in check on a 4-chip v6e slice; native context is 128K — raise it if you have headroom.
VLLM_USE_V1=1 selects the V1 engine (default on recent vLLM; explicit here to match the upstream TPU recipe).
vLLM uses 90% of device memory by default; the TPU recipe pushes --gpu-memory-utilization 0.98 to maximize KV cache.
--max-num-batched-tokens and --max-num-seqs tune batch throughput vs latency.

Benchmarking

Launch the server with --no-enable-prefix-caching for consistent measurements.

vllm bench serve \
  --host 0.0.0.0 \
  --port 8000 \
  --model Qwen/Qwen2.5-32B \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --num-prompts 128