{
  "meta": {
    "title": "Qwen3.6-35B-A3B",
    "slug": "qwen3.6-35b-a3b",
    "provider": "Qwen",
    "description": "Smaller Qwen3.6 multimodal MoE model (35B total / 3B active) with 256 experts (8 routed + 1 shared), gated delta networks architecture, and 262K context",
    "date_updated": "2026-04-18",
    "difficulty": "beginner",
    "tasks": [
      "multimodal",
      "text"
    ],
    "performance_headline": "Compact Qwen3.6 MoE with 3B active parameters — single-GPU FP8 or 2-4 GPU BF16 serving",
    "related_recipes": [
      "Qwen/Qwen3.5-397B-A17B"
    ],
    "hardware": {
      "h100": "verified",
      "h200": "verified",
      "mi300x": "verified",
      "mi325x": "verified",
      "mi355x": "verified"
    }
  },
  "model": {
    "model_id": "Qwen/Qwen3.6-35B-A3B",
    "min_vllm_version": "0.17.0",
    "architecture": "moe",
    "parameter_count": "35B",
    "active_parameters": "3B",
    "context_length": 262144,
    "base_args": [
      "--trust-remote-code"
    ],
    "base_env": {}
  },
  "features": {
    "tool_calling": {
      "description": "Enable automatic tool choice with Qwen3 Coder parser",
      "args": [
        "--enable-auto-tool-choice",
        "--tool-call-parser",
        "qwen3_coder"
      ]
    },
    "reasoning": {
      "description": "Enable chain-of-thought reasoning with Qwen3 parser",
      "args": [
        "--reasoning-parser",
        "qwen3"
      ]
    },
    "spec_decoding": {
      "description": "Multi-token prediction speculative decoding for lower latency",
      "args": [
        "--speculative-config",
        "{\"method\":\"mtp\",\"num_speculative_tokens\":2}"
      ]
    },
    "text_only": {
      "description": "Skip loading the vision encoder for text-only workloads — frees VRAM for KV cache. Mutually exclusive with encoder_parallel.",
      "args": [
        "--language-model-only"
      ]
    },
    "encoder_parallel": {
      "description": "Run the vision encoder in data-parallel mode — avoids TP comm overhead on the small encoder. Mutually exclusive with text_only.",
      "args": [
        "--mm-encoder-tp-mode",
        "data"
      ]
    }
  },
  "opt_in_features": [
    "spec_decoding",
    "text_only"
  ],
  "variants": {
    "default": {
      "precision": "bf16",
      "vram_minimum_gb": 84,
      "description": "Full precision BF16 — fits on 1x H200 or 2x H100"
    },
    "fp8": {
      "model_id": "Qwen/Qwen3.6-35B-A3B-FP8",
      "precision": "fp8",
      "vram_minimum_gb": 42,
      "description": "Qwen official FP8 checkpoint — single-GPU serving"
    }
  },
  "compatible_strategies": [
    "single_node_tp",
    "single_node_tep",
    "single_node_dep",
    "multi_node_tp",
    "multi_node_tp_pp",
    "multi_node_dep",
    "multi_node_tep"
  ],
  "hardware_overrides": {
    "amd": {
      "extra_args": [],
      "extra_env": {
        "VLLM_ROCM_USE_AITER": "1"
      }
    }
  },
  "strategy_overrides": {},
  "guide": "## Overview\n\n[Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) is the smaller sibling\nof Qwen3.5, sharing the same gated-delta-networks MoE architecture but with 35B total\nparameters and 3B activated (256 experts, 8 routed + 1 shared). With FP8 weights it\nfits comfortably on a single 80 GB GPU and supports the full 262K context.\n\n## Prerequisites\n\n- **vLLM version:** >= 0.17.0\n- **Hardware (BF16):** 1x H200 or 2x H100\n- **Hardware (FP8):** single H100/H200 or 1x MI300X/MI325X/MI355X\n\n### Install vLLM\n\n```bash\nuv venv\nsource .venv/bin/activate\nuv pip install -U vllm --torch-backend=auto\n```\n\n## Launching the Server\n\n### Single-GPU FP8\n\n```bash\nvllm serve Qwen/Qwen3.6-35B-A3B-FP8 \\\n  --max-model-len 262144 \\\n  --reasoning-parser qwen3\n```\n\n### BF16 on 2xH200 (TP2)\n\n```bash\nvllm serve Qwen/Qwen3.6-35B-A3B \\\n  --tensor-parallel-size 2 \\\n  --max-model-len 262144 \\\n  --reasoning-parser qwen3\n```\n\n### MTP speculative decoding\n\n```bash\nvllm serve Qwen/Qwen3.6-35B-A3B \\\n  --tensor-parallel-size 2 \\\n  --max-model-len 262144 \\\n  --reasoning-parser qwen3 \\\n  --speculative-config '{\"method\": \"mtp\", \"num_speculative_tokens\": 2}'\n```\n\n### AMD (MI300X / MI325X / MI355X)\n\n```bash\nVLLM_ROCM_USE_AITER=1 vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \\\n  --max-model-len 262144 \\\n  --reasoning-parser qwen3 \\\n  --trust-remote-code\n```\n\n## Client Usage\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(api_key=\"EMPTY\", base_url=\"http://localhost:8000/v1\")\nresp = client.chat.completions.create(\n    model=\"Qwen/Qwen3.6-35B-A3B\",\n    messages=[{\"role\": \"user\", \"content\": \"Explain gated delta networks in one paragraph.\"}],\n    max_tokens=512,\n)\nprint(resp.choices[0].message.content)\n```\n\n## Troubleshooting\n\n- **CUDA graph / Mamba cache size error:** reduce `--max-cudagraph-capture-size`\n  (default 512). See [vLLM PR #34571](https://github.com/vllm-project/vllm/pull/34571).\n- **Reasoning disable:** add `--default-chat-template-kwargs '{\"enable_thinking\": false}'`.\n- **Prefix Caching (Mamba):** currently experimental in \"align\" mode.\n\n## References\n\n- [Qwen3.6-35B-A3B on Hugging Face](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)\n- [FP8 checkpoint](https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8)\n- [Qwen3.5 recipe (sibling 397B-A17B flagship)](../Qwen3.5-397B-A17B)\n",
  "hf_org": "Qwen",
  "hf_repo": "Qwen3.6-35B-A3B",
  "hf_id": "Qwen/Qwen3.6-35B-A3B"
}