vLLM/Recipes
MiniMax

MiniMaxAI/MiniMax-M2.7

MiniMax M2.7 MoE language model (230B total / 10B active) — latest M2 release for coding, agent toolchains, and long-context reasoning with native FP8

View on HuggingFace
moe230B / 10B196,608 ctxvLLM 0.20.0+text
Guide

Overview

MiniMax-M2.7 is the latest release in the MiniMax M2 series. Like earlier M2 variants, it ships with 10B active parameters out of 230B total, and supports a 196K context per sequence. MiniMax has verified M2.7 accuracy on AIME25, GPQA-D, and GSM8K at vLLM commit 0f3ce4c74b1875791d6604e006b6e905fde9f698.

Prerequisites

  • OS: Linux
  • Python: 3.10 - 3.13
  • NVIDIA: compute capability >= 7.0; ~220 GB for weights + 240 GB per 1M context tokens
  • AMD: MI300X / MI325X / MI350X / MI355X with ROCm 7.0+

Install vLLM (NVIDIA, verified commit)

uv venv
source .venv/bin/activate
export VLLM_COMMIT=0f3ce4c74b1875791d6604e006b6e905fde9f698
uv pip install vllm \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}

Install vLLM (NVIDIA, nightly)

uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly

Install vLLM (AMD ROCm)

uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/

Docker (dedicated M2-series image)

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:minimax27 MiniMaxAI/MiniMax-M2.7 \
      --tensor-parallel-size 4 \
      --tool-call-parser minimax_m2 \
      --reasoning-parser minimax_m2 \
      --enable-auto-tool-choice \
      --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
      --trust-remote-code

Launching the Server

NVIDIA — TP4 (4x H200/H20/H100 or 4x A100/A800)

vllm serve MiniMaxAI/MiniMax-M2.7 \
  --tensor-parallel-size 4 \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
  --enable-auto-tool-choice \
  --trust-remote-code

Pure TP8 is not supported. For >4 GPUs use DP+EP or TP+EP:

DP8+EP

vllm serve MiniMaxAI/MiniMax-M2.7 \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
  --enable-auto-tool-choice
vllm serve MiniMaxAI/MiniMax-M2.7 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
  --enable-auto-tool-choice

TP8+EP

vllm serve MiniMaxAI/MiniMax-M2.7 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
  --enable-auto-tool-choice

AMD ROCm — TP2 or TP4

VLLM_ROCM_USE_AITER=1 vllm serve MiniMaxAI/MiniMax-M2.7 \
  --tensor-parallel-size 4 \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --enable-auto-tool-choice \
  --trust-remote-code

Benchmarking

vllm bench serve \
  --backend vllm \
  --model MiniMaxAI/MiniMax-M2.7 \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 2048 \
  --random-output 1024 \
  --max-concurrency 10 \
  --num-prompt 100

Troubleshooting

  • fuse_minimax_qk_norm not recognized: This fusion was introduced in vLLM PR #37045; ensure your vLLM build includes it.
  • Corrupted output on stable release: Upgrade to a nightly after commit cf3eacfe58fa9e745c2854782ada884a9f992cf7.
  • Verified accuracy: Use the pinned commit 0f3ce4c74b1875791d6604e006b6e905fde9f698 if reproducing MiniMax's reported results.
  • DeepGEMM: vLLM uses DeepGEMM by default; install via install_deepgemm.sh if missing.
  • AITER first launch: Initial AMD launch JIT-compiles optimized kernels; subsequent launches reuse cached kernels.

References