Wan-AI/Wan2.2-T2V-A14B-Diffusers

Wan2.2 video generation models — T2V/I2V MoE (14B active) and unified TI2V (5B dense), served via vLLM-Omni

View on HuggingFace

moe28B / 14B0 ctxvLLM 0.12.0+vLLM-Omninightlyomni

Guide

Overview

Wan2.2 is a video generation family served via vLLM-Omni with optional Cache-DiT acceleration:

Wan-AI/Wan2.2-T2V-A14B-Diffusers — Text-to-Video (MoE, 14B active)
Wan-AI/Wan2.2-I2V-A14B-Diffusers — Image-to-Video (MoE, 14B active)
Wan-AI/Wan2.2-TI2V-5B-Diffusers — Unified Text+Image-to-Video (dense 5B)

Prerequisites

vLLM-Omni on top of vLLM 0.12.0
diffusers (bundled in vLLM-Omni CLI scripts)

Installation

uv venv
source .venv/bin/activate
uv pip install vllm==0.12.0
uv pip install git+https://github.com/vllm-project/vllm-omni.git@ef01223c42be10ee260b9f6e5ec31894cd09d86e

AMD ROCm (MI300X/MI325X/MI355X)

uv venv
source .venv/bin/activate
uv pip install vllm==0.12.0 --extra-index-url https://wheels.vllm.ai/rocm/
uv pip install git+https://github.com/vllm-project/vllm-omni.git@ef01223c42be10ee260b9f6e5ec31894cd09d86e

⚠️ The vLLM ROCm wheel requires Python 3.12, ROCm 7.0, and glibc >= 2.35.

Text-to-Video (T2V)

from vllm_omni.entrypoints.omni import Omni

omni = Omni(model="Wan-AI/Wan2.2-T2V-A14B-Diffusers")
frames = omni.generate(
    "Two anthropomorphic cats in comfy boxing gear fight on a spotlighted stage.",
    height=720, width=1280,
    num_frames=81,
    num_inference_steps=40,
    guidance_scale=4.0,
)

CLI:

python examples/offline_inference/text_to_video/text_to_video.py \
  --model Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --prompt "A serene lakeside sunrise with mist over the water." \
  --height 720 --width 1280 \
  --num_frames 81 --num_inference_steps 40 \
  --guidance_scale 4.0 --fps 24 \
  --output t2v_output.mp4

Running T2V on MI300X/MI325X/MI355X GPUs

VLLM_ROCM_USE_AITER=1 \
SAFETENSORS_FAST_GPU=1 \
python examples/offline_inference/text_to_video/text_to_video.py \
  --model Wan-AI/Wan2.2-T2V-A14B-Diffusers \
  --prompt "A serene lakeside sunrise with mist over the water." \
  --height 720 --width 1280 \
  --num_frames 81 --num_inference_steps 40 \
  --guidance_scale 4.0 --fps 24 \
  --output t2v_output.mp4

Image-to-Video (I2V)

import PIL.Image
from vllm_omni.entrypoints.omni import Omni

omni = Omni(model="Wan-AI/Wan2.2-I2V-A14B-Diffusers")
image = PIL.Image.open("input.jpg").convert("RGB")

frames = omni.generate(
    "A cat playing with yarn",
    pil_image=image,
    height=480, width=832,
    num_frames=81,
    num_inference_steps=50,
    guidance_scale=5.0,
)

Running I2V on MI300X/MI325X/MI355X GPUs

VLLM_ROCM_USE_AITER=1 \
SAFETENSORS_FAST_GPU=1 \
python examples/offline_inference/image_to_video/image_to_video.py \
  --model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
  --image input.jpg --prompt "A cat playing with yarn" \
  --num_frames 81 --num_inference_steps 50 \
  --guidance_scale 5.0 --fps 16 --output i2v_output.mp4

TI2V CLI:

python examples/offline_inference/image_to_video/image_to_video.py \
  --model Wan-AI/Wan2.2-TI2V-5B-Diffusers \
  --image input.jpg --prompt "A cat playing with yarn" \
  --num_frames 81 --num_inference_steps 50 \
  --guidance_scale 5.0 --fps 16 --output ti2v_output.mp4

Running TI2V on MI300X/MI325X/MI355X GPUs

VLLM_ROCM_USE_AITER=1 \
SAFETENSORS_FAST_GPU=1 \
python examples/offline_inference/image_to_video/image_to_video.py \
  --model Wan-AI/Wan2.2-TI2V-5B-Diffusers \
  --image input.jpg --prompt "A cat playing with yarn" \
  --num_frames 81 --num_inference_steps 50 \
  --guidance_scale 5.0 --fps 16 --output ti2v_output.mp4

Cache-DiT Acceleration

omni = Omni(
    model="Wan-AI/Wan2.2-T2V-A14B-Diffusers",
    cache_backend="cache_dit",
    cache_config={
        "Fn_compute_blocks": 8,
        "Bn_compute_blocks": 0,
        "max_warmup_steps": 4,
        "residual_diff_threshold": 0.12,
    },
)

Key Parameters

Parameter	Default	Description
`height`	720 (T2V) / auto (I2V)	Video height (multiples of 16)
`width`	1280 (T2V) / auto (I2V)	Video width (multiples of 16)
`num_frames`	81	Frames to generate
`num_inference_steps`	40–50	Denoising steps
`guidance_scale`	4.0–5.0	Classifier-free guidance scale
`boundary_ratio`	0.875	MoE boundary split ratio
`flow_shift`	5.0 (720p) / 12.0 (480p)	Scheduler flow shift