vLLM/Recipes
Qwen

Qwen/Qwen3-VL-235B-A22B-Instruct

Qwen3-VL flagship MoE vision-language model with 235B total / 22B active parameters, supporting images, video, and long context.

View on HuggingFace
moe235B / 22B262,144 ctxvLLM 0.11.0+multimodaltext
Guide

Overview

Qwen3-VL is the most powerful vision-language model in the Qwen series, delivering upgrades to text understanding & generation, visual perception & reasoning, extended context, spatial/video dynamics, and agent interaction. The flagship Qwen3-VL-235B-A22B-Instruct is a MoE model that requires at least 8 GPUs with ≥80 GB memory each (A100/H100/H200 class).

Prerequisites

uv venv
source .venv/bin/activate

# Install vLLM >= 0.11.0
uv pip install -U vllm

# Install Qwen-VL utility library (recommended for offline inference)
uv pip install qwen-vl-utils==0.0.14

Deployment Configurations

H100 (Image + Video, FP8)

vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
  --tensor-parallel-size 8 \
  --mm-encoder-tp-mode data \
  --enable-expert-parallel \
  --async-scheduling

H100 (Image-Only, FP8, TP4)

vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
  --tensor-parallel-size 4 \
  --limit-mm-per-prompt.video 0 \
  --async-scheduling \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 128

A100 & H100 (Image-Only, BF16)

vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel-size 8 \
  --limit-mm-per-prompt.video 0 \
  --async-scheduling

A100 & H100 (Image + Video, BF16)

vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 128000 \
  --async-scheduling

H200 & B200

vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel-size 8 \
  --mm-encoder-tp-mode data \
  --async-scheduling

MI300X/MI325X/MI355X (BF16)

MIOPEN_USER_DB_PATH="$(pwd)/miopen" \
MIOPEN_FIND_MODE=FAST \
VLLM_ROCM_USE_AITER=1 \
SAFETENSORS_FAST_GPU=1 \
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel 4 \
  --mm-encoder-tp-mode data

Configuration Tips

  • Use --limit-mm-per-prompt.video 0 if your server only serves image inputs to save memory.
  • OMP_NUM_THREADS=1 reduces CPU contention during preprocessing.
  • The model's context length is 262K. Reduce --max-model-len (e.g. 128000) if you don't need the full range.
  • --async-scheduling overlaps scheduling with decoding for better throughput.
  • --mm-encoder-tp-mode data deploys the vision encoder in data-parallel fashion for better performance.
  • If your inputs are mostly unique, pass --mm-processor-cache-gb 0 to skip caching overhead.
  • Extend context with YaRN: --rope-scaling '{"rope_type":"yarn","factor":3.0,"original_max_position_embeddings":262144,"mrope_section":[24,20,20],"mrope_interleaved":true}' --max-model-len 1000000

Text-only mode: pass --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 0 to free memory for KV cache when serving text-only traffic.

Client Usage

import time
from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1", timeout=3600)

messages = [{
    "role": "user",
    "content": [
        {"type": "image_url", "image_url": {"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"}},
        {"type": "text", "text": "Read all the text in the image."},
    ],
}]

start = time.time()
response = client.chat.completions.create(
    model="Qwen/Qwen3-VL-235B-A22B-Instruct",
    messages=messages,
    max_tokens=2048,
)
print(f"Response costs: {time.time() - start:.2f}s")
print(response.choices[0].message.content)

Troubleshooting

  • OOM on A100 / H100 BF16: reduce --max-model-len, drop to image-only, or switch to the FP8 checkpoint.
  • If enabling --mm-encoder-tp-mode data raises memory pressure, lower --gpu-memory-utilization.

References