mistralai/Mistral-Small-4-119B-2603
Mistral Small 4 (119B MoE, 6.5B active) — multimodal hybrid instruct + reasoning model with native FP8 weights and 256K context
View on HuggingFaceGuide
Overview
Mistral Small 4 119B is a hybrid Mixture-of-Experts model from Mistral AI:
128 experts, 4 active per token (plus 1 shared expert), 119B total
parameters with 6.5B activated per token. It unifies the capabilities of
three earlier Mistral families — Instruct, Reasoning (formerly
Magistral), and Devstral — into a single checkpoint, with per-request
toggling between fast instant-reply and step-by-step reasoning via
reasoning_effort.
The weights ship pre-quantized to FP8 E4M3 (the vision tower, multimodal
projector, and lm_head are kept in BF16). Multimodal input accepts text
and image (Pixtral-style encoder, 1540x1540, patch size 14). Context length
is 256K via YaRN scaling (factor 128x over the 8K base) — the config
advertises 1M positions but Mistral recommends serving at 256K.
Mistral Small 4 also ships two companion checkpoints:
- NVFP4 (
mistralai/Mistral-Small-4-119B-2603-NVFP4) — 4-bitcompressed-tensorsweights (~72 GB raw); served with theTRITON_MLAbackend. - EAGLE draft head (
mistralai/Mistral-Small-4-119B-2603-eagle) — a 2-layer Mistral-style draft trained on the 119B target; enable via thespec_decodingfeature.
Prerequisites
- Hardware: 2xB200 / 2xH200 / 2xMI300X (FP8 default) or 1xB200 with reduced context (NVFP4).
- vLLM >= 0.20.0 — earlier releases load the model but trip on the Mistral tool-call / reasoning parsers fixed in PR #39217 (in 0.19.1+) and the grammar factory landed in 0.20.0.
Install vLLM
uv venv && source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto
uv pip install -U "mistral_common>=1.11.0" transformers
Verify with python -c "import mistral_common; print(mistral_common.__version__)".
Launch command
FP8 default (2xB200 / 2xH200):
vllm serve mistralai/Mistral-Small-4-119B-2603 \
--max-model-len 262144 \
--tensor-parallel-size 2 \
--attention-backend FLASH_ATTN_MLA \
--tool-call-parser mistral --enable-auto-tool-choice \
--reasoning-parser mistral \
--max_num_batched_tokens 16384 --max_num_seqs 128 \
--gpu_memory_utilization 0.8
NVFP4 variant (single B200, or 2xB200 for full 256K context):
vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4 \
--max-model-len 262144 \
--tensor-parallel-size 2 \
--attention-backend TRITON_MLA \
--tool-call-parser mistral --enable-auto-tool-choice \
--reasoning-parser mistral \
--max_num_batched_tokens 16384 --max_num_seqs 128 \
--gpu_memory_utilization 0.8
Mistral publishes a custom Docker image
mistralllm/vllm-ms4:latest
with patched tool-call / reasoning parsing — use it if you're pinned to a
vLLM version below 0.20.0 and hit either of those issues.
EAGLE speculative decoding
The EAGLE draft head is not included in the default config — toggle the
spec_decoding feature (or pass --speculative_config directly):
vllm serve mistralai/Mistral-Small-4-119B-2603 \
--max-model-len 262144 \
--tensor-parallel-size 2 \
--attention-backend FLASH_ATTN_MLA \
--tool-call-parser mistral --enable-auto-tool-choice \
--reasoning-parser mistral \
--max_num_batched_tokens 16384 --max_num_seqs 128 \
--gpu_memory_utilization 0.8 \
--speculative_config '{
"model": "mistralai/Mistral-Small-4-119B-2603-eagle",
"num_speculative_tokens": 3,
"method": "eagle",
"max_model_len": "65536"
}'
Client usage
Reasoning is opt-in per request via reasoning_effort. Only "none" and
"high" are accepted — anything else raises an exception in the chat
template. Recommended sampling:
reasoning_effort="none":temperature0.0–0.7 depending on task.reasoning_effort="high":temperature=0.7.
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
model="mistralai/Mistral-Small-4-119B-2603",
messages=[{"role": "user", "content": "Plan a 3-day Paris trip."}],
extra_body={"reasoning_effort": "high"},
temperature=0.7, max_tokens=4096,
)
msg = resp.choices[0].message
print("reasoning:", getattr(msg, "reasoning_content", None))
print("answer:", msg.content)
Image input (vision):
resp = client.chat.completions.create(
model="mistralai/Mistral-Small-4-119B-2603",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://..."}},
{"type": "text", "text": "Describe this image."},
],
}],
max_tokens=512,
)
Tool calling follows the standard OpenAI schema — the chat template emits
[AVAILABLE_TOOLS] / [TOOL_CALLS] tokens which the mistral tool-call
parser surfaces as message.tool_calls.
Troubleshooting
- OOM at full 256K context on 2xH200: drop
--max-model-lento 131072 or 65536, or move to NVFP4. reasoning_effortrejected: the chat template only accepts"none"and"high".- NVFP4 weight-loader errors on older wheels: try Mistral's
mistralllm/vllm-ms4:latestDocker image, which carries the parser / weight-loader fixes Mistral ships ahead of the upstream merge.