vLLM/Recipes
Qwen

Qwen/Qwen3-Next-80B-A3B-Instruct

Advanced Qwen3-Next MoE model (80B total / 3B active) with hybrid attention, highly sparse experts, and multi-token prediction.

View on HuggingFace
moe80B / 3B262,144 ctxvLLM 0.10.0+text
Guide

Overview

Qwen3-Next is an advanced LLM from the Qwen team featuring:

  • A hybrid attention mechanism
  • A highly sparse Mixture-of-Experts (MoE) structure
  • Training-stability-friendly optimizations
  • A multi-token prediction mechanism for faster inference

Prerequisites

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Deployment Configurations

Launch on 4x H200/H20 or 4x A100/A800 GPUs.

Basic Multi-GPU (BF16)

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
  --tensor-parallel-size 4 \
  --served-model-name qwen3-next \
  --enable-prefix-caching

If you hit torch.AcceleratorError: CUDA error: an illegal memory access was encountered, add --compilation_config.cudagraph_mode=PIECEWISE.

FP8 (SM90/SM100)

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
  --tensor-parallel-size 4 \
  --enable-prefix-caching

On SM100, accelerate with the FP8 FlashInfer TRTLLM MoE kernel:

VLLM_USE_FLASHINFER_MOE_FP8=1 \
VLLM_FLASHINFER_MOE_BACKEND=latency \
VLLM_USE_DEEP_GEMM=0 \
VLLM_USE_TRTLLM_ATTENTION=0 \
VLLM_ATTENTION_BACKEND=FLASH_ATTN \
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
  --tensor-parallel-size 4

MTP (Multi-Token Prediction)

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
  --tokenizer-mode auto --gpu-memory-utilization 0.8 \
  --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}' \
  --tensor-parallel-size 4 --no-enable-chunked-prefill

Tool / Function Calling

vllm serve ... --tool-call-parser hermes --enable-auto-tool-choice

AMD (MI300X/MI325X/MI355X)

uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.14.1/rocm700
SAFETENSORS_FAST_GPU=1 \
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --no-enable-prefix-caching \
  --trust-remote-code

Client Usage

Benchmark:

vllm bench serve \
  --backend vllm \
  --model Qwen/Qwen3-Next-80B-A3B-Instruct \
  --served-model-name qwen3-next \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 2048 \
  --random-output 1024 \
  --max-concurrency 10 \
  --num-prompt 100

Troubleshooting

  • Sub-optimal MoE performance warning: Tune the MoE Triton kernel with benchmark_moe, then set VLLM_TUNED_CONFIG_FOLDER to the directory containing the generated config.
  • IMA error in DP mode: add --compilation_config.cudagraph_mode=PIECEWISE.
  • For more parallel topologies, see the Data Parallel Deployment docs.

References