MiniMaxAI/MiniMax-M2
MiniMax M2 MoE language model (230B total / 10B active) for coding, agent toolchains, and long-context reasoning — native FP8 checkpoint
View on HuggingFaceGuide
Overview
MiniMax-M2 is an advanced MoE language model from MiniMax. Highlights:
- Superior intelligence — #1 among open-source models globally on math, science, coding, tool use
- Advanced coding — multi-file edits, run-fix loops, test-validated repairs (SWE-Bench, Terminal-Bench)
- Agent performance — plans and executes complex toolchains across shell, browser, and code
- Efficient design — 10B active / 230B total for low latency and high throughput
- 196K context length per sequence
Prerequisites
- OS: Linux
- Python: 3.10 - 3.13
- NVIDIA: compute capability >= 7.0; ~220 GB for weights + 240 GB per 1M context tokens
- AMD: MI300X / MI325X / MI350X / MI355X with ROCm 7.0+
Install vLLM (NVIDIA, stable)
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Install vLLM (NVIDIA, nightly)
If you hit corrupted output, upgrade to a nightly after commit
cf3eacfe58fa9e745c2854782ada884a9f992cf7:
uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly
Install vLLM (AMD ROCm)
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/
Docker (NVIDIA, dedicated M2 image)
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:minimax27 MiniMaxAI/MiniMax-M2 \
--tensor-parallel-size 4 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--enable-auto-tool-choice \
--compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
--trust-remote-code
Launching the Server
NVIDIA — TP4 (4x H200/H20/H100 or 4x A100/A800)
vllm serve MiniMaxAI/MiniMax-M2 \
--tensor-parallel-size 4 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
--enable-auto-tool-choice \
--trust-remote-code
Pure TP8 is not supported. For >4 GPUs use DP+EP or TP+EP:
TP4+EP (recommended for H100)
vllm serve MiniMaxAI/MiniMax-M2 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
--enable-auto-tool-choice
DP8+EP
vllm serve MiniMaxAI/MiniMax-M2 \
--data-parallel-size 8 \
--enable-expert-parallel \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
--enable-auto-tool-choice
AMD ROCm — TP2 or TP4
VLLM_ROCM_USE_AITER=1 vllm serve MiniMaxAI/MiniMax-M2 \
--tensor-parallel-size 4 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--enable-auto-tool-choice \
--trust-remote-code
Benchmarking
vllm bench serve \
--backend vllm \
--model MiniMaxAI/MiniMax-M2 \
--endpoint /v1/completions \
--dataset-name random \
--random-input 2048 \
--random-output 1024 \
--max-concurrency 10 \
--num-prompt 100
Troubleshooting
fuse_minimax_qk_normnot recognized: This fusion was introduced in vLLM PR #37045; ensure your vLLM build includes it.- Corrupted output on stable release: Upgrade to a nightly after commit
cf3eacfe58fa9e745c2854782ada884a9f992cf7. - DeepGEMM: vLLM uses DeepGEMM by default; install via install_deepgemm.sh when missing.
- AITER first launch: AMD initial launch JIT-compiles CK-based FP8 MoE, RMSNorm, and activation kernels. Subsequent launches use cached kernels.