XiaomiMiMo/MiMo-V2.5-Pro
Xiaomi's flagship MoE reasoning model (1.02T total / 42B active) with hybrid attention, native FP8 weights, and Multi-Token Prediction
View on HuggingFaceGuide
Overview
MiMo-V2.5-Pro is Xiaomi's flagship MoE reasoning model with 1.02T total parameters and 42B active per token. It uses 384 routed experts (top-8) with hybrid attention (full-attention + sliding-window 128 at 6:1 ratio) over 70 layers (1 dense + 69 MoE) and ships with native FP8 (block-wise e4m3) weights. A 3-layer Multi-Token Prediction (MTP) head enables speculative decoding for ~3x output speed.
Prerequisites
- Hardware: 8x H200 (TP8)
Pull the vLLM docker image
Stable vLLM does not yet support MiMo V2.5. Use the pre-built image:
docker pull vllm/vllm-openai:mimov25-cu129
Launch commands
Single-node TP8 (H200):
vllm serve XiaomiMiMo/MiMo-V2.5-Pro \
--tensor-parallel-size 8 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--max-model-len auto \
--generation-config vllm
With tool calling + reasoning:
vllm serve XiaomiMiMo/MiMo-V2.5-Pro \
--tensor-parallel-size 8 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--max-model-len auto \
--reasoning-parser mimo \
--tool-call-parser mimo \
--enable-auto-tool-choice \
--generation-config vllm
Client Usage
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "XiaomiMiMo/MiMo-V2.5-Pro",
"messages": [{"role": "user", "content": "Hello MiMo!"}],
"chat_template_kwargs": {"enable_thinking": true}
}'
Set "enable_thinking": false (or omit the kwargs) to disable thinking mode.