Qwen/Qwen2.5-VL-7B-Instruct
Qwen2.5-VL dense vision-language model (7B) for image and video understanding — fits on a single TPU v6e chip or one GPU.
Verified on TPU v6e (Trillium) with BF16 on a single chip
Guide
Overview
Qwen2.5-VL-7B-Instruct is the small dense vision-language model in the Qwen2.5-VL series. At 7B it fits comfortably on a single accelerator — one TPU v6e (Trillium) chip, or a single 24GB+ GPU — making it the cheapest entry point for image and video understanding in the family. BF16 is the precision used in training, so BF16 inference gives the best accuracy.
For the 72B sibling, see Qwen2.5-VL-72B-Instruct.
Prerequisites
NVIDIA
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
TPU (Trillium / v6e)
Use the official vLLM TPU image. The 7B model fits on a single v6e chip (v6e-1, --topology 1x1).
export HF_HOME=/dev/shm
export HF_TOKEN=<your HF token>
docker run --rm --privileged --net=host \
--shm-size=16G \
-e HF_HOME -e HF_TOKEN \
vllm/vllm-tpu:latest \
vllm serve ...
Deployment Configurations
TPU v6e (Trillium, single chip)
Verified end-to-end on a single v6e chip.
export HF_HOME=/dev/shm
export HF_TOKEN=<your HF token>
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
--tensor-parallel-size 1 \
--pipeline-parallel-size 1 \
--dtype bfloat16 \
--gpu-memory-utilization 0.98 \
--max-model-len 16384 \
--limit-mm-per-prompt '{"image":10,"video":0}' \
--mm-processor-kwargs '{"max_pixels":1003520}' \
--guided-decoding-backend xgrammar \
--disable-chunked-mm-input
Single GPU (BF16, TP=1)
export CUDA_VISIBLE_DEVICES=0
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--limit-mm-per-prompt '{"image":2,"video":0}'
Multi-GPU (DP=4)
For higher throughput across a node, data parallelism works better than TP for a model this size.
export CUDA_VISIBLE_DEVICES=0,1,2,3
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--data-parallel-size 4 \
--limit-mm-per-prompt '{"image":2,"video":0}'
Configuration Tips
--max-model-len 16384is a safe default on a single v6e chip (native context is 128K); raise it if you have headroom.--limit-mm-per-promptcaps incoming multimodal requests per prompt.--mm-processor-kwargs '{"max_pixels":1003520}'bounds the per-image resolution to control encoder cost.- vLLM uses 90% of device memory by default; on TPU the recipe pushes
--gpu-memory-utilization 0.98to maximize KV cache. --disable-chunked-mm-inputis recommended on TPU for stable multimodal batching.
Benchmarking
Launch the server with --no-enable-prefix-caching to get consistent measurements.
VisionArena-Chat
vllm bench serve \
--host 0.0.0.0 \
--port 8000 \
--backend openai-chat \
--endpoint /v1/chat/completions \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--dataset-name hf \
--dataset-path lmarena-ai/VisionArena-Chat \
--num-prompts 128
Random Synthetic
vllm bench serve \
--host 0.0.0.0 \
--port 8000 \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--num-prompts 128