Qwen/Qwen2.5-VL-72B-Instruct

Qwen2.5-VL dense vision-language model (72B) for high-quality image and video understanding.

Verified on 4x A100 and 4x MI300X/MI325X/MI355X with BF16

View on HuggingFace

dense72B128,000 ctxvLLM 0.7.0+multimodaltext

Guide

Overview

This guide describes how to run Qwen2.5-VL series on the targeted accelerated stack. Since BF16 is the commonly used precision for Qwen2.5-VL training, using BF16 in inference ensures the best accuracy.

Prerequisites

NVIDIA

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

AMD ROCm (MI300X, MI325X, MI355X)

Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. Use the Docker flow if your environment is incompatible.

uv venv
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm

TPU Deployment

TPU support has been verified for the smaller Qwen2.5-VL-7B-Instruct on a single v6e chip — the 72B is not yet validated on Trillium. See the upstream TPU recipe (7B).

Deployment Configurations

4xA100 (BF16, TP=4)

export CUDA_VISIBLE_DEVICES=0,1,2,3
vllm serve Qwen/Qwen2.5-VL-72B-Instruct  \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4 \
  --mm-encoder-tp-mode data \
  --limit-mm-per-prompt '{"image":2,"video":0}'

4xMI300X/MI325X/MI355X (BF16, TP=4)

export CUDA_VISIBLE_DEVICES=0,1,2,3
export VLLM_ROCM_USE_AITER=1
vllm serve Qwen/Qwen2.5-VL-72B-Instruct  \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4 \
  --mm-encoder-tp-mode data \
  --limit-mm-per-prompt '{"image":2,"video":0}'

Qwen2.5-VL-7B-Instruct (DP=4)

For medium-size 7B model, data parallelism works better.

export CUDA_VISIBLE_DEVICES=0,1,2,3
vllm serve Qwen/Qwen2.5-VL-7B-Instruct  \
  --host 0.0.0.0 \
  --port 8000 \
  --data-parallel-size 4 \
  --limit-mm-per-prompt '{"image":2,"video":0}'

Configuration Tips

--max-model-len 65536 is usually good for most scenarios (native context is 128K).
For A100-80GB devices, TP must be >= 2 to avoid OOM.
--limit-mm-per-prompt caps incoming multimodal requests.
--mm-encoder-tp-mode data deploys the small ViT encoder in DP fashion (ViT ≈ 675M vs 72B LM).
vLLM uses 90% of GPU memory by default; set --gpu-memory-utilization=0.95 to maximize KV cache.

Benchmarking

Launch the server with --no-enable-prefix-caching to get consistent measurements.

VisionArena-Chat

vllm bench serve \
  --host 0.0.0.0 \
  --port 8000 \
  --backend openai-chat \
  --endpoint /v1/chat/completions \
  --model Qwen/Qwen2.5-VL-72B-Instruct \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --num-prompts 128

Random Synthetic

vllm bench serve \
  --host 0.0.0.0 \
  --port 8000 \
  --model Qwen/Qwen2.5-VL-72B-Instruct \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --num-prompts 128

Workload mixes:

Prompt-heavy: 8000 in / 1000 out
Decode-heavy: 1000 in / 8000 out
Balanced: 1000 in / 1000 out