MiniMaxAI/MiniMax-M2

MiniMax M2 MoE language model (230B total / 10B active) for coding, agent toolchains, and long-context reasoning — native FP8 checkpoint, with an NVFP4 variant for Blackwell

Open-source MoE with strong SWE-Bench and Terminal-Bench performance, 196K context

View on HuggingFace

moe230B / 10B196,608 ctxvLLM 0.11.0+text

Guide

Overview

MiniMax-M2 is an advanced MoE language model from MiniMax. Highlights:

Superior intelligence — #1 among open-source models globally on math, science, coding, tool use
Advanced coding — multi-file edits, run-fix loops, test-validated repairs (SWE-Bench, Terminal-Bench)
Agent performance — plans and executes complex toolchains across shell, browser, and code
Efficient design — 10B active / 230B total for low latency and high throughput
196K context length per sequence

Prerequisites

OS: Linux
Python: 3.10 - 3.13
NVIDIA: compute capability >= 7.0; ~220 GB for weights + 240 GB per 1M context tokens
AMD: MI300X / MI325X / MI350X / MI355X with ROCm 7.0+

Install vLLM (NVIDIA, stable)

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Install vLLM (NVIDIA, nightly)

If you hit corrupted output, upgrade to a nightly after commit cf3eacfe58fa9e745c2854782ada884a9f992cf7:

uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly

Install vLLM (AMD ROCm)

uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/

Docker (NVIDIA, dedicated M2 image)

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:minimax27 MiniMaxAI/MiniMax-M2 \
      --tensor-parallel-size 4 \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2 \
      --enable-auto-tool-choice \
      --trust-remote-code

Launching the Server

NVIDIA — TP4 (4x H200/H20/H100 or 4x A100/A800)

vllm serve MiniMaxAI/MiniMax-M2 \
  --tensor-parallel-size 4 \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --enable-auto-tool-choice \
  --trust-remote-code

Pure TP8 is not supported. For >4 GPUs use DP+EP or TP+EP:

TP4+EP (recommended for H100)

vllm serve MiniMaxAI/MiniMax-M2 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --enable-auto-tool-choice

DP8+EP

vllm serve MiniMaxAI/MiniMax-M2 \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --enable-auto-tool-choice

AMD ROCm — TP2

VLLM_ROCM_USE_AITER=1 vllm serve MiniMaxAI/MiniMax-M2 \
  --tensor-parallel-size 2 \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --enable-auto-tool-choice \
  --trust-remote-code

Benchmarking

vllm bench serve \
  --backend vllm \
  --model MiniMaxAI/MiniMax-M2 \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 2048 \
  --random-output 1024 \
  --max-concurrency 10 \
  --num-prompt 100

Troubleshooting

MiniMax QK norm fusion: Latest vLLM applies the fusion automatically at runtime when the fused CUDA op is available; older vLLM builds used an explicit fuse_minimax_qk_norm compile pass.
Corrupted output on stable release: Upgrade to a nightly after commit cf3eacfe58fa9e745c2854782ada884a9f992cf7.
DeepGEMM: vLLM uses DeepGEMM by default; install via install_deepgemm.sh when missing.
AITER first launch: AMD initial launch JIT-compiles CK-based FP8 MoE, RMSNorm, and activation kernels. Subsequent launches use cached kernels.

Quantized Variant (NVFP4)

RedHatAI/MiniMax-M2-NVFP4 is an NVFP4 (4-bit) checkpoint — roughly half the VRAM of the native FP8 checkpoint. Select the nvfp4 variant above (it sets VLLM_USE_FLASHINFER_MOE_FP4=1 and --kv-cache-dtype fp8), or pass the repo id directly to vllm serve. NVFP4 requires a Blackwell GPU (compute capability 10.0+).