vLLM/Recipes
Qwen

Qwen/Qwen3.6-35B-A3B

Smaller Qwen3.6 multimodal MoE model (35B total / 3B active) with BF16, FP8, and NVIDIA NVFP4 variants

Compact Qwen3.6 MoE with 3B active parameters — single-GPU FP8 or 2-4 GPU BF16 serving

moe35B / 3B262,144 ctxvLLM 0.17.0+multimodaltext
Guide

Overview

Qwen3.6-35B-A3B is the smaller sibling of Qwen3.5, sharing the same gated-delta-networks MoE architecture but with 35B total parameters and 3B activated (256 experts, 8 routed + 1 shared). The recipe covers the BF16 base model, Qwen's official FP8 checkpoint, and NVIDIA's ModelOpt NVFP4 checkpoint.

Prerequisites

  • vLLM version: >= 0.17.0
  • Hardware (BF16): 1x H200 or 2x H100
  • Hardware (FP8): single H100/H200 or 1x MI300X/MI325X/MI355X
  • Hardware (NVFP4): NVIDIA Blackwell GPUs, including DGX Spark (GB10)

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto

Launching the Server

Single-GPU FP8

vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

BF16 on 2xH200 (TP2)

vllm serve Qwen/Qwen3.6-35B-A3B \
  --tensor-parallel-size 2 \
  --max-model-len 262144 \
  --reasoning-parser qwen3

MTP speculative decoding

vllm serve Qwen/Qwen3.6-35B-A3B \
  --tensor-parallel-size 2 \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}'

AMD (MI300X / MI325X / MI355X)

VLLM_ROCM_USE_AITER=1 vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --max-model-len 262144 \
  --reasoning-parser qwen3 \
  --trust-remote-code

DGX Spark NVFP4

The NVIDIA ModelOpt NVFP4 checkpoint is served from nvidia/Qwen3.6-35B-A3B-NVFP4 and requires vLLM nightly or a source build with ModelOpt W4A16/NVFP4 support.

export VLLM_FP8_MOE_BACKEND=flashinfer_cutlass
export FLASHINFER_DISABLE_VERSION_CHECK=1
export CUTE_DSL_ARCH=sm_121a
vllm serve nvidia/Qwen3.6-35B-A3B-NVFP4 \
  --trust-remote-code \
  --dtype auto \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --attention-backend flashinfer \
  --moe-backend marlin \
  --gpu-memory-utilization 0.85 \
  --max-model-len 65536 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 8192 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}'

The Spark serving shape intentionally caps context and concurrency with --max-model-len 65536, --max-num-seqs 4, and --max-num-batched-tokens 8192. Optional throughput knobs such as --async-scheduling can be added after profiling. On vLLM V1/nightly, prefix caching and chunked prefill are enabled by default when supported, so this recipe does not pass them explicitly.

Processing Ultra-Long Texts

Qwen3.6-35B-A3B natively supports 262,144 tokens. For longer inputs, apply YaRN RoPE scaling via --hf-overrides and raise --max-model-len. Pick factor to match your real workload — 2.0 covers ~524K, 4.0 covers ~1M — since YaRN at higher factors degrades short-context quality.

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --tensor-parallel-size 2 \
  --max-model-len 1010000 \
  --reasoning-parser qwen3 \
  --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}'

See the model card for the full parameter reference.

Client Usage

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=[{"role": "user", "content": "Explain gated delta networks in one paragraph."}],
    max_tokens=512,
)
print(resp.choices[0].message.content)

Troubleshooting

  • CUDA graph / Mamba cache size error: reduce --max-cudagraph-capture-size (default 512). See vLLM PR #34571.
  • Reasoning disable: add --default-chat-template-kwargs '{"enable_thinking": false}'.
  • Prefix Caching (Mamba): currently experimental in "align" mode.

References