vLLM/Recipes
Xiaomi MiMo

XiaomiMiMo/MiMo-V2.5-Pro

Xiaomi's flagship MoE reasoning model (1.02T total / 42B active) with hybrid attention, native FP8 weights, and Multi-Token Prediction

View on HuggingFace
moe1T / 42B1,048,576 ctxvLLM nightly+text
Guide

Overview

MiMo-V2.5-Pro is Xiaomi's flagship MoE reasoning model with 1.02T total parameters and 42B active per token. It uses 384 routed experts (top-8) with hybrid attention (full-attention + sliding-window 128 at 6:1 ratio) over 70 layers (1 dense + 69 MoE) and ships with native FP8 (block-wise e4m3) weights. A 3-layer Multi-Token Prediction (MTP) head enables speculative decoding for ~3x output speed.

Prerequisites

  • Hardware: 8x H200 (TP8)

Pull the vLLM docker image

Stable vLLM does not yet support MiMo V2.5. Use the pre-built image:

docker pull vllm/vllm-openai:mimov25-cu129

Launch commands

Single-node TP8 (H200):

vllm serve XiaomiMiMo/MiMo-V2.5-Pro \
  --tensor-parallel-size 8 \
  --trust-remote-code \
  --gpu-memory-utilization 0.95 \
  --max-model-len auto \
  --generation-config vllm

With tool calling + reasoning:

vllm serve XiaomiMiMo/MiMo-V2.5-Pro \
  --tensor-parallel-size 8 \
  --trust-remote-code \
  --gpu-memory-utilization 0.95 \
  --max-model-len auto \
  --reasoning-parser mimo \
  --tool-call-parser mimo \
  --enable-auto-tool-choice \
  --generation-config vllm

Client Usage

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "XiaomiMiMo/MiMo-V2.5-Pro",
    "messages": [{"role": "user", "content": "Hello MiMo!"}],
    "chat_template_kwargs": {"enable_thinking": true}
  }'

Set "enable_thinking": false (or omit the kwargs) to disable thinking mode.

References