MiniMaxAI/MiniMax-M3
MiniMax M3 vision-language MoE (427B total / 26B active) for frontier coding, agent toolchains, and 1M-token reasoning via MSA sparse attention — native multimodal (image + video + computer use); BF16 checkpoint with an MXFP8 variant from NVIDIA. Runs on NVIDIA (Hopper/Blackwell) and on AMD CDNA4 (MI350X/MI355X) and CDNA3 (MI300X/MI325X).
Frontier coding and agent (SWE-Bench Pro 59.0, Terminal-Bench 2.1 66.0); MSA sparse attention; 1M context
Guide
Overview
MiniMax-M3 is a frontier vision-language MoE model from MiniMax.
- MSA (MiniMax Sparse Attention) — scalable sparse-attention architecture
that lifts the context window to 1M tokens. MiniMax reports per-token
compute at 1M context reduced to ~1/20 of the previous generation, with
9× prefill and >15× decode speedup vs dense baselines.
- Frontier coding and agent capabilities — SWE-Bench Pro 59.0%, Terminal-Bench 2.1 66.0%, SWE-fficiency 34.8%, KernelBench Hard 28.8%, MCP Atlas 74.2%.
- Native multimodal — image + video inputs, plus computer-use; trained multimodally from step 0.
- Two reasoning modes —
thinking(complex reasoning / agents) andnon-thinking(latency-sensitive), switchable per request.
Prerequisites
- OS: Linux
- Python: 3.10 - 3.13
- NVIDIA: compute capability >= 9.0 (Hopper) recommended; 8x H200 / H20 for a tight single-node BF16 fit, or multi-node TP for long-context headroom
- AMD: MI350X/MI355X (gfx950), MI300X/MI325X (gfx942), ROCm 7.2+. BF16 needs TP=8; the MXFP8 variant runs from TP=4.
--block-size 128is mandatory on every platform (MSA sparse/index cache).
Docker (NVIDIA)
MiniMax-M3 support has not yet shipped in a stable vLLM release — use the dedicated Docker image:
docker pull vllm/vllm-openai:minimax-m3
Docker (AMD ROCm)
docker run --rm -it --device /dev/kfd --device /dev/dri --group-add video \
--cap-add SYS_PTRACE --security-opt seccomp=unconfined --ipc=host \
--shm-size=16g -p 8000:8000 \
--entrypoint /bin/bash \
$AMD_DOCKER_IMAGE
Launching the Server
NVIDIA — TP8 (8x H200 / H20)
vllm serve MiniMaxAI/MiniMax-M3 \
--tensor-parallel-size 8 \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
TP8 + Expert Parallel
vllm serve MiniMaxAI/MiniMax-M3 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
DP8 + Expert Parallel
vllm serve MiniMaxAI/MiniMax-M3 \
--data-parallel-size 8 \
--enable-expert-parallel \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
AMD ROCm (MI350X/MI355X (gfx950), MI300X/MI325X (gfx942))
For gfx950: Prefer using the MXFP8 variant MiniMaxAI/MiniMax-M3-MXFP8 for TP=4 and a smaller
footprint. Use TP=8 for lower latency or long context length, or the default bf16 model.
Context Length & GPU Memory
The full 1M-token window (context_length: 1048576) needs a large KV
cache. To save GPU memory, you can optionally cap the context with
--max-model-len:
vllm serve MiniMaxAI/MiniMax-M3 \
--tensor-parallel-size 8 \
--block-size 128 \
--max-model-len 131072 # 128K instead of the full 1M
- Set
--max-model-lento the longest prompt + output you actually serve (e.g.32768,131072,262144). A smaller window frees KV-pool headroom for higher concurrency and lets the model fit on fewer GPUs; if you need the full 1M window, consider scaling out with multi-node TP instead.
Client Usage
Recommended sampling parameters (from the model card):
temperature = 1.0top_p = 0.95top_k = 40
Default system prompt:
You are a helpful assistant. Your name is MiniMax-M3 and is built by MiniMax.
Example chat request:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMaxAI/MiniMax-M3",
"temperature": 1.0,
"top_p": 0.95,
"messages": [
{"role": "system", "content": "You are a helpful assistant. Your name is MiniMax-M3 and is built by MiniMax."},
{"role": "user", "content": "Explain MSA sparse attention in 3 bullets."}
]
}'
Thinking Modes
M3 reasoning is controlled by the thinking_mode, there are three values:
enabled— the model thinks before every response, including after tool results. Use for complex reasoning and agents.disabled— no thinking; the model answers directly. Use for latency-sensitive turns.adaptive(default when unset) — the model decides whether to think based on the task.
Pass it per request through chat_template_kwargs. The same value also tunes
the minimax_m3 reasoning parser, so reasoning_content and content are
split correctly in every mode.
# Start the MiniMax-M3 model by referring to the command above first.
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="MiniMaxAI/MiniMax-M3",
messages=[{"role": "user", "content": "Prove there are infinitely many primes."}],
extra_body={"chat_template_kwargs": {"thinking_mode": "enabled"}},
)
msg = response.choices[0].message
# vLLM exposes the <mm:think> block as `reasoning` (the older
# `reasoning_content` field is deprecated but still aliased).
print(getattr(msg, "reasoning", None) or getattr(msg, "reasoning_content", None))
print(msg.content) # the final answer
Benchmarking
vllm bench serve \
--backend vllm \
--model MiniMaxAI/MiniMax-M3 \
--endpoint /v1/completions \
--dataset-name random \
--random-input 2048 \
--random-output 1024 \
--max-concurrency 10 \
--num-prompt 100
Quantized Variant (MXFP8)
MiniMaxAI/MiniMax-M3-MXFP8
is an MXFP8 checkpoint quantized by NVIDIA from the original FP16 weights —
roughly half the VRAM of the BF16 release. Select the mxfp8 variant above,
or pass the repo id directly to vllm serve:
vllm serve MiniMaxAI/MiniMax-M3-MXFP8 \
--tensor-parallel-size 8 \
--block-size 128 \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice
For best MXFP8 throughput, prefer Blackwell (B200/B300) for native MX tensor cores, or AMD CDNA4 (MI350X/MI355X, gfx950) for native MXFP8 matrix cores.
Troubleshooting
--block-sizemismatch. MSA's sparse block size is 128; the vLLM KV cache block size must match. Using the default (16) misaligns the sparse attention indexing (on AMD it crashes withNo common block size for 16).- Parsers.
--tool-call-parserand--reasoning-parserboth useminimax_m3— distinct fromminimax_m2used by earlier releases. - Long context KV cache. See Context Length & GPU Memory above — cap
--max-model-lenor scale to multi-node TP if you OOM. - Vision encoder. The encoder is small, so at high TP the Encoder
Parallel option runs it data-parallel (
--mm-encoder-tp-mode data) to avoid TP comm overhead; it also turns on the vision-encoder attention backend (FlashInfer on NVIDIA,--mm-encoder-attn-backend FLASHINFER; AITER FlashAttention on AMD,ROCM_AITER_FA) and the host-shared-memory multimodal processor cache (--mm-processor-cache-type shm). For text-only workloads enable Text only (--language-model-only) to skip loading the encoder and free VRAM — it is mutually exclusive with Encoder Parallel.