XiaomiMiMo/MiMo-V2.5
MiMo-V2.5 is a native omnimodal model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified architecture
View on HuggingFaceGuide
Overview
MiMo-V2.5 is Xiaomi's native omnimodal MoE model with 310B total parameters and 15B active per token, supporting text, image, video, and audio understanding within a unified architecture. Built on the MiMo-V2-Flash backbone with dedicated vision (729M) and audio (261M) encoders, it uses 256 routed experts (top-8) with hybrid attention (SWA-128 + full-attention at 5:1 ratio) over 48 layers (1 dense + 47 MoE) and ships with native FP8 (block-wise e4m3) weights. A 3-layer Multi-Token Prediction (MTP) head is included for speculative decoding.
Prerequisites
- Hardware: 4x H200 (TP4)
Pull the vLLM docker image
Stable vLLM does not yet support MiMo V2.5. Use the pre-built image:
docker pull vllm/vllm-openai:mimov25-cu129
Launch commands
Single-node TP4 (H200):
vllm serve XiaomiMiMo/MiMo-V2.5 \
--tensor-parallel-size 4 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--max-model-len auto \
--generation-config vllm
With tool calling + reasoning:
vllm serve XiaomiMiMo/MiMo-V2.5 \
--tensor-parallel-size 4 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--max-model-len auto \
--reasoning-parser mimo \
--tool-call-parser mimo \
--enable-auto-tool-choice \
--generation-config vllm
Tunable flags:
--max-model-len— full context is 1,048,576; useautoto size to KV budget.--max-num-batched-tokens=32768for prompt-heavy workloads; lower for latency.--gpu-memory-utilization=0.95to maximize KV cache.
Client Usage
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "XiaomiMiMo/MiMo-V2.5",
"messages": [{"role": "user", "content": "Hello MiMo!"}],
"chat_template_kwargs": {"enable_thinking": true}
}'
Set "enable_thinking": false (or omit the kwargs) to disable thinking mode.
Benchmarking
Launch the server with --no-enable-prefix-caching to get consistent measurements.
VisionArena-Chat
vllm bench serve \
--model XiaomiMiMo/MiMo-V2.5 \
--backend openai-chat \
--endpoint /v1/chat/completions \
--dataset-name hf \
--dataset-path lmarena-ai/VisionArena-Chat \
--num-prompts 128
Random Synthetic
vllm bench serve \
--model XiaomiMiMo/MiMo-V2.5 \
--dataset-name random --random-input-len 8000 --random-output-len 1000 \
--request-rate 3 --num-prompts 1800 --ignore-eos