vLLM/Recipes
Xiaomi MiMo

XiaomiMiMo/MiMo-V2.5

MiMo-V2.5 is a native omnimodal model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified architecture

View on HuggingFace
moe310B / 15B1,048,576 ctxvLLM nightly+multimodaltext
Guide

Overview

MiMo-V2.5 is Xiaomi's native omnimodal MoE model with 310B total parameters and 15B active per token, supporting text, image, video, and audio understanding within a unified architecture. Built on the MiMo-V2-Flash backbone with dedicated vision (729M) and audio (261M) encoders, it uses 256 routed experts (top-8) with hybrid attention (SWA-128 + full-attention at 5:1 ratio) over 48 layers (1 dense + 47 MoE) and ships with native FP8 (block-wise e4m3) weights. A 3-layer Multi-Token Prediction (MTP) head is included for speculative decoding.

Prerequisites

  • Hardware: 4x H200 (TP4)

Pull the vLLM docker image

Stable vLLM does not yet support MiMo V2.5. Use the pre-built image:

docker pull vllm/vllm-openai:mimov25-cu129

Launch commands

Single-node TP4 (H200):

vllm serve XiaomiMiMo/MiMo-V2.5 \
  --tensor-parallel-size 4 \
  --trust-remote-code \
  --gpu-memory-utilization 0.95 \
  --max-model-len auto \
  --generation-config vllm

With tool calling + reasoning:

vllm serve XiaomiMiMo/MiMo-V2.5 \
  --tensor-parallel-size 4 \
  --trust-remote-code \
  --gpu-memory-utilization 0.95 \
  --max-model-len auto \
  --reasoning-parser mimo \
  --tool-call-parser mimo \
  --enable-auto-tool-choice \
  --generation-config vllm

Tunable flags:

  • --max-model-len — full context is 1,048,576; use auto to size to KV budget.
  • --max-num-batched-tokens=32768 for prompt-heavy workloads; lower for latency.
  • --gpu-memory-utilization=0.95 to maximize KV cache.

Client Usage

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "XiaomiMiMo/MiMo-V2.5",
    "messages": [{"role": "user", "content": "Hello MiMo!"}],
    "chat_template_kwargs": {"enable_thinking": true}
  }'

Set "enable_thinking": false (or omit the kwargs) to disable thinking mode.

Benchmarking

Launch the server with --no-enable-prefix-caching to get consistent measurements.

VisionArena-Chat

vllm bench serve \
  --model XiaomiMiMo/MiMo-V2.5 \
  --backend openai-chat \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --num-prompts 128

Random Synthetic

vllm bench serve \
  --model XiaomiMiMo/MiMo-V2.5 \
  --dataset-name random --random-input-len 8000 --random-output-len 1000 \
  --request-rate 3 --num-prompts 1800 --ignore-eos

References