Qwen/Qwen2.5-VL-72B-Instruct
Qwen2.5-VL dense vision-language model (72B) for high-quality image and video understanding.
View on HuggingFaceGuide
Overview
This guide describes how to run Qwen2.5-VL series on the targeted accelerated stack. Since BF16 is the commonly used precision for Qwen2.5-VL training, using BF16 in inference ensures the best accuracy.
Prerequisites
NVIDIA
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
AMD ROCm (MI300X, MI325X, MI355X)
Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. Use the Docker flow if your environment is incompatible.
uv venv
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm
TPU Deployment
Deployment Configurations
4xA100 (BF16, TP=4)
export CUDA_VISIBLE_DEVICES=0,1,2,3
vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 4 \
--mm-encoder-tp-mode data \
--limit-mm-per-prompt '{"image":2,"video":0}'
4xMI300X/MI325X/MI355X (BF16, TP=4)
export CUDA_VISIBLE_DEVICES=0,1,2,3
export VLLM_ROCM_USE_AITER=1
vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 4 \
--mm-encoder-tp-mode data \
--limit-mm-per-prompt '{"image":2,"video":0}'
Qwen2.5-VL-7B-Instruct (DP=4)
For medium-size 7B model, data parallelism works better.
export CUDA_VISIBLE_DEVICES=0,1,2,3
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--data-parallel-size 4 \
--limit-mm-per-prompt '{"image":2,"video":0}'
Configuration Tips
--max-model-len 65536is usually good for most scenarios (native context is 128K).- For A100-80GB devices, TP must be >= 2 to avoid OOM.
--limit-mm-per-promptcaps incoming multimodal requests.--mm-encoder-tp-mode datadeploys the small ViT encoder in DP fashion (ViT ≈ 675M vs 72B LM).- vLLM uses 90% of GPU memory by default; set
--gpu-memory-utilization=0.95to maximize KV cache.
Benchmarking
Launch the server with --no-enable-prefix-caching to get consistent measurements.
VisionArena-Chat
vllm bench serve \
--host 0.0.0.0 \
--port 8000 \
--backend openai-chat \
--endpoint /v1/chat/completions \
--model Qwen/Qwen2.5-VL-72B-Instruct \
--dataset-name hf \
--dataset-path lmarena-ai/VisionArena-Chat \
--num-prompts 128
Random Synthetic
vllm bench serve \
--host 0.0.0.0 \
--port 8000 \
--model Qwen/Qwen2.5-VL-72B-Instruct \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--num-prompts 128
Workload mixes:
- Prompt-heavy: 8000 in / 1000 out
- Decode-heavy: 1000 in / 8000 out
- Balanced: 1000 in / 1000 out