Qwen/Qwen3-VL-235B-A22B-Instruct

Qwen3-VL flagship MoE vision-language model with 235B total / 22B active parameters, supporting images, video, and long context.

Strong on images, video, and text — #1 open model on text on lmarena.ai at release

View on HuggingFace

moe235B / 22B262,144 ctxvLLM 0.11.0+multimodaltext

Guide

Overview

Qwen3-VL is the most powerful vision-language model in the Qwen series, delivering upgrades to text understanding & generation, visual perception & reasoning, extended context, spatial/video dynamics, and agent interaction. The flagship Qwen3-VL-235B-A22B-Instruct is a MoE model that requires at least 8 GPUs with ≥80 GB memory each (A100/H100/H200 class).

Prerequisites

uv venv
source .venv/bin/activate

# Install vLLM >= 0.11.0
uv pip install -U vllm

# Install Qwen-VL utility library (recommended for offline inference)
uv pip install qwen-vl-utils==0.0.14

Deployment Configurations

H100 (Image + Video, FP8)

vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
  --tensor-parallel-size 8 \
  --mm-encoder-tp-mode data \
  --enable-expert-parallel \
  --async-scheduling

H100 (Image-Only, FP8, TP4)

vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
  --tensor-parallel-size 4 \
  --limit-mm-per-prompt.video 0 \
  --async-scheduling \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 128

A100 & H100 (Image-Only, BF16)

vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel-size 8 \
  --limit-mm-per-prompt.video 0 \
  --async-scheduling

A100 & H100 (Image + Video, BF16)

vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 128000 \
  --async-scheduling

H200 & B200

vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel-size 8 \
  --mm-encoder-tp-mode data \
  --async-scheduling

MI300X/MI325X/MI355X (BF16, TP8)

export SAFETENSORS_FAST_GPU="1"
export HIP_FORCE_DEV_KERNARG="1"
export HIP_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
export VLLM_WORKER_MULTIPROC_METHOD="spawn"
export VLLM_ROCM_USE_AITER="1"
export VLLM_ROCM_USE_AITER_MHA="1"
export VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT="1"
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
  --tensor-parallel-size 8 \
  --mm-encoder-tp-mode data \
  --enable-expert-parallel \
  --async-scheduling \
  --gpu-memory-utilization 0.94 \
  --max-model-len 32768 \
  --max-num-seqs 10240 \
  --max-num-batched-tokens 32768 \
  --attention-backend ROCM_AITER_FA

MI300X/MI325X/MI355X (FP8, TP8)

export SAFETENSORS_FAST_GPU="1"
export HIP_FORCE_DEV_KERNARG="1"
export HIP_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
export VLLM_WORKER_MULTIPROC_METHOD="spawn"
export VLLM_ROCM_USE_AITER="1"
export VLLM_ROCM_USE_AITER_MHA="1"
export VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT="1"
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
  --tensor-parallel-size 8 \
  --mm-encoder-tp-mode data \
  --enable-expert-parallel \
  --async-scheduling \
  --enable-chunked-prefill \
  --limit-mm-per-prompt '{"image": 1}' \
  --gpu-memory-utilization 0.94 \
  --max-model-len 32768 \
  --max-num-seqs 10240 \
  --max-num-batched-tokens 32768 \
  --kv-cache-dtype fp8 \
  --compilation-config '{"mode": 3, "cudagraph_mode": "FULL_AND_PIECEWISE", "custom_ops": ["+rms_norm", "+quant_fp8"]}' \
  --attention-backend ROCM_AITER_FA

The above configuration runs efficiently at high concurrencies (up to 512) for context lengths up to 8k. If TPOT is not satisfactory, lower --max-num-batched-tokens. For longer contexts, --max-num-seqs and --max-num-batched-tokens may need retuning.

MI300X / MI355X (FP8, EAGLE3 Speculative Decoding)

Adds EAGLE3 speculative decoding (RedHatAI/Qwen3-VL-235B-A22B-Instruct-speculator.eagle3) on top of the FP8 target. --attention-backend ROCM_AITER_FA is mandatory — the default ROCM_ATTN path makes SD slower than baseline on ROCm. Keep VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=0 and omit --async-scheduling (incompatible with SD).

export SAFETENSORS_FAST_GPU="1"
export HIP_FORCE_DEV_KERNARG="1"
export HIP_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
export VLLM_USE_V1="1"
export VLLM_WORKER_MULTIPROC_METHOD="spawn"
export VLLM_ROCM_USE_AITER="1"
export VLLM_ROCM_USE_AITER_MHA="1"
export VLLM_ROCM_USE_AITER_RMSNORM="0"
export VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT="0"
export VLLM_RPC_TIMEOUT="300000"

vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.94 \
  --distributed-executor-backend mp \
  --enable-chunked-prefill \
  --max-model-len 16384 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 8192 \
  --mm-encoder-tp-mode data \
  --enable-expert-parallel \
  --attention-backend ROCM_AITER_FA \
  --compilation-config '{"mode": 3, "cudagraph_mode": "FULL_AND_PIECEWISE", "custom_ops": ["-rms_norm"], "pass_config": {"fuse_norm_quant":false}}' \
  --speculative-config '{"model": "RedHatAI/Qwen3-VL-235B-A22B-Instruct-speculator.eagle3", "method": "eagle3", "num_speculative_tokens": 4}'

Configuration Tips

Use --limit-mm-per-prompt.video 0 if your server only serves image inputs to save memory.
OMP_NUM_THREADS=1 reduces CPU contention during preprocessing.
The model's context length is 262K. Reduce --max-model-len (e.g. 128000) if you don't need the full range.
--async-scheduling overlaps scheduling with decoding for better throughput.
--mm-encoder-tp-mode data deploys the vision encoder in data-parallel fashion for better performance.
If your inputs are mostly unique, pass --mm-processor-cache-gb 0 to skip caching overhead.
Extend context with YaRN: --rope-scaling '{"rope_type":"yarn","factor":3.0,"original_max_position_embeddings":262144,"mrope_section":[24,20,20],"mrope_interleaved":true}' --max-model-len 1000000

Text-only mode: pass --limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 0 to free memory for KV cache when serving text-only traffic.

Client Usage

import time
from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1", timeout=3600)

messages = [{
    "role": "user",
    "content": [
        {"type": "image_url", "image_url": {"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"}},
        {"type": "text", "text": "Read all the text in the image."},
    ],
}]

start = time.time()
response = client.chat.completions.create(
    model="Qwen/Qwen3-VL-235B-A22B-Instruct",
    messages=messages,
    max_tokens=2048,
)
print(f"Response costs: {time.time() - start:.2f}s")
print(response.choices[0].message.content)

Troubleshooting

OOM on A100 / H100 BF16: reduce --max-model-len, drop to image-only, or switch to the FP8 checkpoint.
If enabling --mm-encoder-tp-mode data raises memory pressure, lower --gpu-memory-utilization.