MiniMaxAI/MiniMax-M2.5

MiniMax M2.5 MoE language model (230B total / 10B active) for coding, agent toolchains, and long-context reasoning — native FP8 checkpoint

View on HuggingFace

moe230B / 10B196,608 ctxvLLM 0.19.0+text

Guide

Overview

MiniMax-M2.5 is part of the MiniMax M2 series of advanced MoE language models. It retains the M2 architecture (10B active, 230B total) and a 196K context per sequence.

Prerequisites

OS: Linux
Python: 3.10 - 3.13
NVIDIA: compute capability >= 7.0; ~220 GB for weights + 240 GB per 1M context tokens
AMD: MI300X / MI325X / MI350X / MI355X with ROCm 7.0+

Install vLLM (NVIDIA)

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Docker (dedicated M2-series image)

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:minimax27 MiniMaxAI/MiniMax-M2.5 \
      --tensor-parallel-size 4 \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2 \
      --enable-auto-tool-choice \
      --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
      --trust-remote-code

Launching the Server

NVIDIA — TP4

vllm serve MiniMaxAI/MiniMax-M2.5 \
  --tensor-parallel-size 4 \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
  --enable-auto-tool-choice \
  --trust-remote-code

Pure TP8 is not supported. For >4 GPUs use DP+EP or TP+EP.

TP4+EP (recommended for H100)

vllm serve MiniMaxAI/MiniMax-M2.5 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
  --enable-auto-tool-choice

AMD ROCm

VLLM_ROCM_USE_AITER=1 vllm serve MiniMaxAI/MiniMax-M2.5 \
  --tensor-parallel-size 4 \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --enable-auto-tool-choice \
  --trust-remote-code

Benchmarking

vllm bench serve \
  --backend vllm \
  --model MiniMaxAI/MiniMax-M2.5 \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 2048 \
  --random-output 1024 \
  --max-concurrency 10 \
  --num-prompt 100

Troubleshooting

See MiniMax-M2 for shared troubleshooting notes (fuse_minimax_qk_norm, nightly vs stable, DeepGEMM, AITER).