MiniMaxAI/MiniMax-M3

MiniMax M3 vision-language MoE (427B total / 26B active) for frontier coding, agent toolchains, and 1M-token reasoning via MSA sparse attention — native multimodal (image + video + computer use); BF16 plus MXFP8, NVIDIA Blackwell NVFP4, and AMD MI355X MXFP4 variants. Runs on NVIDIA (Hopper/Blackwell) and AMD CDNA4/CDNA3.

Frontier coding and agent (SWE-Bench Pro 59.0, Terminal-Bench 2.1 66.0); MSA sparse attention; 1M context

View on HuggingFace

moe427B / 26B1,048,576 ctxvLLM 0.24.0+textmultimodal

Guide

Overview

MiniMax-M3 is a frontier vision-language MoE model from MiniMax.

MSA (MiniMax Sparse Attention) — scalable sparse-attention architecture that lifts the context window to 1M tokens. MiniMax reports per-token compute at 1M context reduced to ~1/20 of the previous generation, with

9× prefill and >15× decode speedup vs dense baselines.
Frontier coding and agent capabilities — SWE-Bench Pro 59.0%, Terminal-Bench 2.1 66.0%, SWE-fficiency 34.8%, KernelBench Hard 28.8%, MCP Atlas 74.2%.
Native multimodal — image + video inputs, plus computer-use; trained multimodally from step 0.
Two reasoning modes — thinking (complex reasoning / agents) and non-thinking (latency-sensitive), switchable per request.

Prerequisites

OS: Linux
Python: 3.10 - 3.13
NVIDIA: compute capability >= 9.0 (Hopper) recommended; 8x H200 / H20 for a tight single-node BF16 fit, or multi-node TP for long-context headroom
AMD: MI350X/MI355X (gfx950), MI300X/MI325X (gfx942), ROCm 7.2+. BF16 needs TP=8; the MXFP8 variant runs from TP=4.
--block-size 128 is mandatory on every platform (MSA sparse/index cache).

Docker (NVIDIA)

MiniMax-M3 support has not yet shipped in a stable vLLM release — use the dedicated Docker image:

docker pull vllm/vllm-openai:minimax-m3

Docker (AMD ROCm)

MiniMax-M3 support has not yet shipped in a stable vLLM release — use the dedicated Docker image or nightly after the release:

docker pull vllm/vllm-openai-rocm:minimax-m3

docker run --rm -it --device /dev/kfd --device /dev/dri --group-add video \
  --cap-add SYS_PTRACE --security-opt seccomp=unconfined --ipc=host \
  --shm-size=16g -p 8000:8000 \
  --entrypoint /bin/bash \
  vllm/vllm-openai-rocm:minimax-m3

Launching the Server

NVIDIA — TP8 (8x H200 / H20)

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

TP8 + Expert Parallel

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

DP8 + Expert Parallel

vllm serve MiniMaxAI/MiniMax-M3 \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

AMD ROCm (MI350X/MI355X (gfx950), MI300X/MI325X (gfx942))

On AMD MI300X / MI325X / MI355X, run with CUDA graphs and set the following before any of the serve commands below. It avoids the MiniMax-M3 decode breakable-cudagraph path that would otherwise force eager execution (per @hongxiayang):

export VLLM_USE_BREAKABLE_CUDAGRAPH=0

For gfx950: Prefer using the MXFP8 variant MiniMaxAI/MiniMax-M3-MXFP8 for TP=4 and a smaller footprint. Use TP=8 for lower latency or long context length, or the default bf16 model.

TP8 (Text or Vision)

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --attention-backend TRITON_ATTN \
  --mm-encoder-tp-mode data \
  --mm-encoder-attn-backend ROCM_AITER_FA \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

TP8 + Expert Parallel

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --block-size 128 \
  --attention-backend TRITON_ATTN \
  --mm-encoder-tp-mode data \
  --mm-encoder-attn-backend ROCM_AITER_FA \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

DP8 + Expert Parallel

vllm serve MiniMaxAI/MiniMax-M3 \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --block-size 128 \
  --attention-backend TRITON_ATTN \
  --mm-encoder-tp-mode data \
  --mm-encoder-attn-backend ROCM_AITER_FA \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

FP8 KV Cache

Add --kv-cache-dtype fp8 to any command for ~1.5× the KV pool — lossless in our testing across the full native context. Especially worth it for high concurrency or long context, where KV is the binding constraint.

Context Length & GPU Memory

The full 1M-token window (context_length: 1048576) needs a large KV cache. To save GPU memory, you can optionally cap the context with --max-model-len:

vllm serve MiniMaxAI/MiniMax-M3 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --max-model-len 131072        # 128K instead of the full 1M

AMD ROCm notes: Native context is 512K. To go past it, supply a YaRN rope_scaling on the text config (a top-level override silently misses the decoder's config) and allow the long max length. TP=8 + fp8 KV is the practical combo at 1M:

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  vllm serve MiniMaxAI/MiniMax-M3 \
  --block-size 128 \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 8 \
  --max-model-len 1048576 \
  --attention-backend TRITON_ATTN \
  --mm-encoder-tp-mode data \
  --mm-encoder-attn-backend ROCM_AITER_FA \
  --tool-call-parser minimax_m3 \
  --enable-auto-tool-choice \
  --reasoning-parser minimax_m3 \
  --hf-overrides '{"text_config":{"rope_scaling":{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":524288}}}'

Set --max-model-len to the longest prompt + output you actually serve (e.g. 32768, 131072, 262144). A smaller window frees KV-pool headroom for higher concurrency and lets the model fit on fewer GPUs; if you need the full 1M window, consider scaling out with multi-node TP instead.

Client Usage

Recommended sampling parameters (from the model card):

temperature = 1.0
top_p = 0.95
top_k = 40

Default system prompt:

You are a helpful assistant. Your name is MiniMax-M3 and is built by MiniMax.

Example chat request:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMaxAI/MiniMax-M3",
    "temperature": 1.0,
    "top_p": 0.95,
    "messages": [
      {"role": "system", "content": "You are a helpful assistant. Your name is MiniMax-M3 and is built by MiniMax."},
      {"role": "user", "content": "Explain MSA sparse attention in 3 bullets."}
    ]
  }'

Thinking Modes

M3 reasoning is controlled by the thinking_mode, there are three values:

enabled — the model thinks before every response, including after tool results. Use for complex reasoning and agents.
disabled — no thinking; the model answers directly. Use for latency-sensitive turns.
adaptive (default when unset) — the model decides whether to think based on the task.

Pass it per request through chat_template_kwargs. The same value also tunes the minimax_m3 reasoning parser, so reasoning_content and content are split correctly in every mode.


# Start the MiniMax-M3 model by referring to the command above first.

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M3",
    messages=[{"role": "user", "content": "Prove there are infinitely many primes."}],
    extra_body={"chat_template_kwargs": {"thinking_mode": "enabled"}},
)
msg = response.choices[0].message
# vLLM exposes the <mm:think> block as `reasoning` (the older
# `reasoning_content` field is deprecated but still aliased).
print(getattr(msg, "reasoning", None) or getattr(msg, "reasoning_content", None))
print(msg.content)  # the final answer

Benchmarking

vllm bench serve \
  --backend vllm \
  --model MiniMaxAI/MiniMax-M3 \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 2048 \
  --random-output 1024 \
  --max-concurrency 10 \
  --num-prompt 100

Quantized Variant (MXFP8)

MiniMaxAI/MiniMax-M3-MXFP8 is an MXFP8 checkpoint quantized by NVIDIA from the original FP16 weights — roughly half the VRAM of the BF16 release. Select the mxfp8 variant above, or pass the repo id directly to vllm serve:

vllm serve MiniMaxAI/MiniMax-M3-MXFP8 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

For best MXFP8 throughput, prefer Blackwell (B200/B300) for native MX tensor cores, or AMD CDNA4 (MI350X/MI355X, gfx950) for native MXFP8 matrix cores.

On AMD MI300X, the mxfp8 variant stacks two accuracy-safe scheduling levers from the InferenceX MI300X tuning recipe and avoids the high-concurrency expert-parallel regression by using plain TP8:

export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_USE_BREAKABLE_CUDAGRAPH=0
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MHA=0
export TORCH_BLAS_PREFER_HIPBLASLT=1
export NCCL_MIN_NCHANNELS="${NCCL_MIN_NCHANNELS:-112}"
export GPU_MAX_HW_QUEUES="${GPU_MAX_HW_QUEUES:-2}"

vllm serve MiniMaxAI/MiniMax-M3-MXFP8 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --language-model-only \
  --max-model-len "$MAX_MODEL_LEN" \
  --attention-backend TRITON_ATTN \
  --async-scheduling \
  --max-num-batched-tokens 16384 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

On AMD MI355X the mxfp8 variant additionally enables the AITER kernels, the fused shared-experts MoE, and INT4 QuickReduce, and pulls the nightly ROCm image (vllm/vllm-openai-rocm:nightly) — selected automatically in the command builder; other MI3xx GPUs keep the pinned image. If you launch by hand:

export VLLM_USE_BREAKABLE_CUDAGRAPH=0
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4

vllm serve MiniMaxAI/MiniMax-M3-MXFP8 \
  --tensor-parallel-size 8 \
  --block-size 128 \
  --attention-backend TRITON_ATTN \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

MXFP8 high-concurrency options on MI355X

The command above is the default serving configuration and is tuned for the general case. AITER page-16 sparse paged attention helps specifically at long context and high concurrency. It maps MiniMax-M3's top-k 128-token MSA blocks onto AITER page-16 block tables and runs AITER Gluon paged attention over only the selected KV pages.

The high-concurrency configuration is:

# Long context + high concurrency only (isl >= 8k, conc >= 64). Regresses
# short-context / low-concurrency serving.
export VLLM_USE_BREAKABLE_CUDAGRAPH=0
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1
export VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4
export VLLM_ROCM_QUICK_REDUCE_CAST_BF16_TO_FP16=0
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION_MIN_SIZE_KB=256

vllm serve MiniMaxAI/MiniMax-M3-MXFP8 \
  --tensor-parallel-size 4 \
  --block-size 128 \
  --language-model-only \
  --moe-backend aiter \
  --kv-cache-dtype fp8 \
  --max-num-batched-tokens 32768 \
  --attention-backend TRITON_ATTN \
  --linear-backend emulation \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

Note: Outside the long-context/high-batch regime the shuffled layout costs more than it saves; leave it at 0:

export VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=0

Quantized Variant (NVFP4 on Blackwell)

nvidia/MiniMax-M3-NVFP4 is an NVFP4 checkpoint quantized by NVIDIA with ModelOpt. It requires a vLLM build with MiniMax-M3 NVFP4 support (vllm-project/vllm PR #46380).

Select the nvfp4 variant and B200 hardware in the command builder, or launch the tested B200 baseline directly:

VLLM_FLOAT32_MATMUL_PRECISION=high \
  vllm serve nvidia/MiniMax-M3-NVFP4 \
  --tensor-parallel-size 8 \
  --trust-remote-code \
  --block-size 128 \
  --tool-call-parser minimax_m3 \
  --reasoning-parser minimax_m3 \
  --enable-auto-tool-choice

Pair it with EAGLE3 speculative decoding (the Spec decoding feature) for faster decode — the validated B200 benchmark runs the NVFP4 target against the Inferact/MiniMax-M3-EAGLE3 draft head (3 speculative tokens):

VLLM_FLOAT32_MATMUL_PRECISION=high \
  vllm serve nvidia/MiniMax-M3-NVFP4 \
  --tensor-parallel-size 8 \
  --trust-remote-code \
  --block-size 128 \
  --language-model-only \
  --speculative-config '{"method": "eagle3", "model": "Inferact/MiniMax-M3-EAGLE3", "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}'

Quantized Variant (MXFP4 on MI355X)

amd/MiniMax-M3-MXFP4 uses static OCP MXFP4 weights and dynamic OCP MXFP4 activations produced with AMD Quark. It targets AMD Instinct MI355X (gfx950) and requires a ROCm nightly containing the Quark MXFP4 fix from vLLM PR #45794.

Select the mxfp4 variant and MI355X hardware in the command builder, or launch the tested TP8 baseline directly:

vllm serve amd/MiniMax-M3-MXFP4 \
  --tensor-parallel-size 8 \
  --trust-remote-code \
  --block-size 128 \
  --attention-backend TRITON_ATTN \
  --mm-encoder-tp-mode data \
  --mm-encoder-attn-backend ROCM_AITER_FA \
  --tool-call-parser minimax_m3 \
  --enable-auto-tool-choice \
  --reasoning-parser minimax_m3

MXFP4 benchmark reproduction (InferenceX MI355X sweep)

The command above is the default serving configuration. The SemiAnalysis InferenceX MI355X benchmark lane for this checkpoint is a separate high-concurrency benchmark configuration — not the recipe default. To reproduce that sweep, start from the default command above and apply these deltas:

Use --tensor-parallel-size 4 (the sweep runs TP4; the default recipe keeps TP8 for KV-cache and multimodal-encoder headroom).
Add --kv-cache-dtype fp8.
Use the text-only language-model path with --no-enable-prefix-caching, --language-model-only, and an explicit --max-model-len.
Force the AITER MXFP4 MoE backend with --moe-backend aiter plus the AITER MoE env vars below.
Run on a current vllm/vllm-openai-rocm:nightly (the lane pins a specific nightly build — don't hard-code a transient tag).
Export INT4 quantized all-reduce before launch:

# Kernel selection — required to match the benchmarked MoE path.
# Without these (and `--moe-backend aiter` on the serve command), vLLM's MXFP4
# MoE oracle selects the TRITON_UNFUSED fallback on gfx950 instead of the
# AITER_MXFP4_MXFP4 kernel the InferenceX lane actually runs.
export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_USE_BREAKABLE_CUDAGRAPH=0
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1

# Benchmark-only — accuracy-sensitive; NOT part of the default command.
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4
export VLLM_ROCM_QUICK_REDUCE_CAST_BF16_TO_FP16=0
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION_MIN_SIZE_KB=256

The full reproduction serve command is then:

vllm serve amd/MiniMax-M3-MXFP4 \
  --port "${PORT:-8000}" \
  --tensor-parallel-size 4 \
  --trust-remote-code \
  --block-size 128 \
  --no-enable-prefix-caching \
  --language-model-only \
  --max-model-len "$MAX_MODEL_LEN" \
  --attention-backend TRITON_ATTN \
  --moe-backend aiter \
  --kv-cache-dtype fp8 \
  --tool-call-parser minimax_m3 \
  --enable-auto-tool-choice \
  --reasoning-parser minimax_m3

Selecting the AITER MXFP4 MoE backend (--moe-backend aiter) is a kernel-selection choice (same MXFP4 math, faster kernel), not a precision reduction. The INT4 all-reduce and fp8 KV-cache levers are throughput optimizations for high concurrency and are kept out of the default command pending accuracy validation. Performance figures on the InferenceX dashboard are reported externally by SemiAnalysis and have not been reproduced in this repository.

Troubleshooting

--block-size mismatch. MSA's sparse block size is 128; the vLLM KV cache block size must match. Using the default (16) misaligns the sparse attention indexing (on AMD it crashes with No common block size for 16).
Parsers. --tool-call-parser and --reasoning-parser both use minimax_m3 — distinct from minimax_m2 used by earlier releases.
Long context KV cache. See Context Length & GPU Memory above — cap --max-model-len or scale to multi-node TP if you OOM.
Vision encoder. The encoder is small, so at high TP the Encoder Parallel option runs it data-parallel (--mm-encoder-tp-mode data) to avoid TP comm overhead; it also turns on the vision-encoder attention backend (FlashInfer on NVIDIA, --mm-encoder-attn-backend FLASHINFER; AITER FlashAttention on AMD, ROCM_AITER_FA) and the host-shared-memory multimodal processor cache (--mm-processor-cache-type shm). For text-only workloads enable Text only (--language-model-only) to skip loading the encoder and free VRAM — it is mutually exclusive with Encoder Parallel.

Overview

Prerequisites

Docker (NVIDIA)

Docker (AMD ROCm)

Launching the Server

NVIDIA — TP8 (8x H200 / H20)

TP8 + Expert Parallel

DP8 + Expert Parallel

AMD ROCm (MI350X/MI355X (gfx950), MI300X/MI325X (gfx942))

TP8 (Text or Vision)

TP8 + Expert Parallel

DP8 + Expert Parallel

FP8 KV Cache

Context Length & GPU Memory

Client Usage

Thinking Modes

Benchmarking

Quantized Variant (MXFP8)

MXFP8 high-concurrency options on MI355X

Quantized Variant (NVFP4 on Blackwell)

Quantized Variant (MXFP4 on MI355X)

MXFP4 benchmark reproduction (InferenceX MI355X sweep)

Troubleshooting

References