vLLM/Recipes
MiniMax

MiniMaxAI/MiniMax-M3

MiniMax M3 vision-language MoE (427B total / 26B active) for frontier coding, agent toolchains, and 1M-token reasoning via MSA sparse attention — native multimodal (image + video + computer use); BF16 checkpoint with an MXFP8 variant from NVIDIA. Runs on NVIDIA (Hopper/Blackwell) and on AMD CDNA4 (MI350X/MI355X) and CDNA3 (MI300X/MI325X).

Frontier coding and agent (SWE-Bench Pro 59.0, Terminal-Bench 2.1 66.0); MSA sparse attention; 1M context

moe427B / 26B1,048,576 ctxvLLM nightly+textmultimodal
Guide

Overview

MiniMax-M3 is a frontier vision-language MoE model from MiniMax.

  • MSA (MiniMax Sparse Attention) — scalable sparse-attention architecture that lifts the context window to 1M tokens. MiniMax reports per-token compute at 1M context reduced to ~1/20 of the previous generation, with

    9× prefill and >15× decode speedup vs dense baselines.

  • Frontier coding and agent capabilities — SWE-Bench Pro 59.0%, Terminal-Bench 2.1 66.0%, SWE-fficiency 34.8%, KernelBench Hard 28.8%, MCP Atlas 74.2%.
  • Native multimodal — image + video inputs, plus computer-use; trained multimodally from step 0.
  • Two reasoning modesthinking (complex reasoning / agents) and non-thinking (latency-sensitive), switchable per request.

Prerequisites

  • OS: Linux
  • Python: 3.10 - 3.13
  • NVIDIA: compute capability >= 9.0 (Hopper) recommended; 8x H200 / H20 for a tight single-node BF16 fit, or multi-node TP for long-context headroom
  • AMD: MI350X/MI355X (gfx950), MI300X/MI325X (gfx942), ROCm 7.2+. BF16 needs TP=8; the MXFP8 variant runs from TP=4.
  • --block-size 128 is mandatory on every platform (MSA sparse/index cache).

Docker (NVIDIA)

MiniMax-M3 support has not yet shipped in a stable vLLM release — use the dedicated Docker image:

docker pull vllm/vllm-openai:minimax-m3

Docker (AMD ROCm)

docker run --rm -it --device /dev/kfd --device /dev/dri --group-add video \
  --cap-add SYS_PTRACE --security-opt seccomp=unconfined --ipc=host \
  --shm-size=16g -p 8000:8000 \
  --entrypoint /bin/bash \
  $AMD_DOCKER_IMAGE

Launching the Server

NVIDIA — TP8 (8x H200 / H20)

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

TP8 + Expert Parallel

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

DP8 + Expert Parallel

vllm serve MiniMaxAI/MiniMax-M3 \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

AMD ROCm (MI350X/MI355X (gfx950), MI300X/MI325X (gfx942))

For gfx950: Prefer using the MXFP8 variant MiniMaxAI/MiniMax-M3-MXFP8 for TP=4 and a smaller footprint. Use TP=8 for lower latency or long context length, or the default bf16 model.

Context Length & GPU Memory

The full 1M-token window (context_length: 1048576) needs a large KV cache. To save GPU memory, you can optionally cap the context with --max-model-len:

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --max-model-len 131072        # 128K instead of the full 1M
  • Set --max-model-len to the longest prompt + output you actually serve (e.g. 32768, 131072, 262144). A smaller window frees KV-pool headroom for higher concurrency and lets the model fit on fewer GPUs; if you need the full 1M window, consider scaling out with multi-node TP instead.

Client Usage

Recommended sampling parameters (from the model card):

  • temperature = 1.0
  • top_p = 0.95
  • top_k = 40

Default system prompt:

You are a helpful assistant. Your name is MiniMax-M3 and is built by MiniMax.

Example chat request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMaxAI/MiniMax-M3",
    "temperature": 1.0,
    "top_p": 0.95,
    "messages": [
      {"role": "system", "content": "You are a helpful assistant. Your name is MiniMax-M3 and is built by MiniMax."},
      {"role": "user", "content": "Explain MSA sparse attention in 3 bullets."}
    ]
  }'

Thinking Modes

M3 reasoning is controlled by the thinking_mode, there are three values:

  • enabled — the model thinks before every response, including after tool results. Use for complex reasoning and agents.
  • disabled — no thinking; the model answers directly. Use for latency-sensitive turns.
  • adaptive (default when unset) — the model decides whether to think based on the task.

Pass it per request through chat_template_kwargs. The same value also tunes the minimax_m3 reasoning parser, so reasoning_content and content are split correctly in every mode.


# Start the MiniMax-M3 model by referring to the command above first.

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3",
    messages=[{"role": "user", "content": "Prove there are infinitely many primes."}],
    extra_body={"chat_template_kwargs": {"thinking_mode": "enabled"}},
)
msg = response.choices[0].message
# vLLM exposes the <mm:think> block as `reasoning` (the older
# `reasoning_content` field is deprecated but still aliased).
print(getattr(msg, "reasoning", None) or getattr(msg, "reasoning_content", None))
print(msg.content)  # the final answer

Benchmarking

vllm bench serve \
  --backend vllm \
  --model MiniMaxAI/MiniMax-M3 \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 2048 \
  --random-output 1024 \
  --max-concurrency 10 \
  --num-prompt 100

Quantized Variant (MXFP8)

MiniMaxAI/MiniMax-M3-MXFP8 is an MXFP8 checkpoint quantized by NVIDIA from the original FP16 weights — roughly half the VRAM of the BF16 release. Select the mxfp8 variant above, or pass the repo id directly to vllm serve:

vllm serve MiniMaxAI/MiniMax-M3-MXFP8 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

For best MXFP8 throughput, prefer Blackwell (B200/B300) for native MX tensor cores, or AMD CDNA4 (MI350X/MI355X, gfx950) for native MXFP8 matrix cores.

Troubleshooting

  • --block-size mismatch. MSA's sparse block size is 128; the vLLM KV cache block size must match. Using the default (16) misaligns the sparse attention indexing (on AMD it crashes with No common block size for 16).
  • Parsers. --tool-call-parser and --reasoning-parser both use minimax_m3 — distinct from minimax_m2 used by earlier releases.
  • Long context KV cache. See Context Length & GPU Memory above — cap --max-model-len or scale to multi-node TP if you OOM.
  • Vision encoder. The encoder is small, so at high TP the Encoder Parallel option runs it data-parallel (--mm-encoder-tp-mode data) to avoid TP comm overhead; it also turns on the vision-encoder attention backend (FlashInfer on NVIDIA, --mm-encoder-attn-backend FLASHINFER; AITER FlashAttention on AMD, ROCM_AITER_FA) and the host-shared-memory multimodal processor cache (--mm-processor-cache-type shm). For text-only workloads enable Text only (--language-model-only) to skip loading the encoder and free VRAM — it is mutually exclusive with Encoder Parallel.

References