vLLM/Recipes
Mistral AI

mistralai/Mistral-Small-4-119B-2603

Mistral Small 4 (119B MoE, 6.5B active) — multimodal hybrid instruct + reasoning model with native FP8 weights and 256K context

View on HuggingFace
moe119B / 6.5B262,144 ctxvLLM 0.20.0+multimodal
Guide

Overview

Mistral Small 4 119B is a hybrid Mixture-of-Experts model from Mistral AI: 128 experts, 4 active per token (plus 1 shared expert), 119B total parameters with 6.5B activated per token. It unifies the capabilities of three earlier Mistral families — Instruct, Reasoning (formerly Magistral), and Devstral — into a single checkpoint, with per-request toggling between fast instant-reply and step-by-step reasoning via reasoning_effort.

The weights ship pre-quantized to FP8 E4M3 (the vision tower, multimodal projector, and lm_head are kept in BF16). Multimodal input accepts text and image (Pixtral-style encoder, 1540x1540, patch size 14). Context length is 256K via YaRN scaling (factor 128x over the 8K base) — the config advertises 1M positions but Mistral recommends serving at 256K.

Mistral Small 4 also ships two companion checkpoints:

Prerequisites

  • Hardware: 2xB200 / 2xH200 / 2xMI300X (FP8 default) or 1xB200 with reduced context (NVFP4).
  • vLLM >= 0.20.0 — earlier releases load the model but trip on the Mistral tool-call / reasoning parsers fixed in PR #39217 (in 0.19.1+) and the grammar factory landed in 0.20.0.

Install vLLM

uv venv && source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto
uv pip install -U "mistral_common>=1.11.0" transformers

Verify with python -c "import mistral_common; print(mistral_common.__version__)".

Launch command

FP8 default (2xB200 / 2xH200):

vllm serve mistralai/Mistral-Small-4-119B-2603 \
  --max-model-len 262144 \
  --tensor-parallel-size 2 \
  --attention-backend FLASH_ATTN_MLA \
  --tool-call-parser mistral --enable-auto-tool-choice \
  --reasoning-parser mistral \
  --max_num_batched_tokens 16384 --max_num_seqs 128 \
  --gpu_memory_utilization 0.8

NVFP4 variant (single B200, or 2xB200 for full 256K context):

vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4 \
  --max-model-len 262144 \
  --tensor-parallel-size 2 \
  --attention-backend TRITON_MLA \
  --tool-call-parser mistral --enable-auto-tool-choice \
  --reasoning-parser mistral \
  --max_num_batched_tokens 16384 --max_num_seqs 128 \
  --gpu_memory_utilization 0.8

Mistral publishes a custom Docker image mistralllm/vllm-ms4:latest with patched tool-call / reasoning parsing — use it if you're pinned to a vLLM version below 0.20.0 and hit either of those issues.

EAGLE speculative decoding

The EAGLE draft head is not included in the default config — toggle the spec_decoding feature (or pass --speculative_config directly):

vllm serve mistralai/Mistral-Small-4-119B-2603 \
  --max-model-len 262144 \
  --tensor-parallel-size 2 \
  --attention-backend FLASH_ATTN_MLA \
  --tool-call-parser mistral --enable-auto-tool-choice \
  --reasoning-parser mistral \
  --max_num_batched_tokens 16384 --max_num_seqs 128 \
  --gpu_memory_utilization 0.8 \
  --speculative_config '{
    "model": "mistralai/Mistral-Small-4-119B-2603-eagle",
    "num_speculative_tokens": 3,
    "method": "eagle",
    "max_model_len": "65536"
  }'

Client usage

Reasoning is opt-in per request via reasoning_effort. Only "none" and "high" are accepted — anything else raises an exception in the chat template. Recommended sampling:

  • reasoning_effort="none": temperature 0.0–0.7 depending on task.
  • reasoning_effort="high": temperature=0.7.
from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="mistralai/Mistral-Small-4-119B-2603",
    messages=[{"role": "user", "content": "Plan a 3-day Paris trip."}],
    extra_body={"reasoning_effort": "high"},
    temperature=0.7, max_tokens=4096,
)
msg = resp.choices[0].message
print("reasoning:", getattr(msg, "reasoning_content", None))
print("answer:", msg.content)

Image input (vision):

resp = client.chat.completions.create(
    model="mistralai/Mistral-Small-4-119B-2603",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://..."}},
            {"type": "text", "text": "Describe this image."},
        ],
    }],
    max_tokens=512,
)

Tool calling follows the standard OpenAI schema — the chat template emits [AVAILABLE_TOOLS] / [TOOL_CALLS] tokens which the mistral tool-call parser surfaces as message.tool_calls.

Troubleshooting

  • OOM at full 256K context on 2xH200: drop --max-model-len to 131072 or 65536, or move to NVFP4.
  • reasoning_effort rejected: the chat template only accepts "none" and "high".
  • NVFP4 weight-loader errors on older wheels: try Mistral's mistralllm/vllm-ms4:latest Docker image, which carries the parser / weight-loader fixes Mistral ships ahead of the upstream merge.

References