mistralai/Mistral-Medium-3.5-128B
Mistral Medium 3.5 (128B) dense vision-language model with native FP8 weights and 256K context
View on HuggingFaceGuide
Overview
Mistral-Medium-3.5 is a 128B dense vision-language model from Mistral AI. The
weights ship pre-quantized to FP8 (E4M3) with the vision tower, multimodal
projector, and lm_head retained in BF16. Image input is supported up to
1540x1540 (Pixtral-style encoder, patch size 14). Context length is 256K
via YaRN scaling (factor 64x over the 4K base).
Reasoning is opt-in per request via reasoning_effort: "high" — when set,
the model emits [THINK]...[/THINK] blocks that the Mistral reasoning
parser surfaces as message.reasoning_content. Tool calling uses the
[AVAILABLE_TOOLS] / [TOOL_CALLS] chat-template tokens.
Prerequisites
- Hardware: 8xH200 (recommended) or 4xB200; single B200 / MI300X also fits the weights (~134 GB raw) but leaves little room for the 256K KV cache.
- vLLM nightly (Mistral 3.5 architecture support has not yet shipped in a stable release).
Install vLLM
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly
This pulls in mistral_common >= 1.11.1 and transformers >= 5.4.0 automatically.
Launch command
8xH200 (or 8xB200):
vllm serve mistralai/Mistral-Medium-3.5-128B \
--tensor-parallel-size 8 \
--tokenizer_mode mistral --config_format mistral --load_format mistral \
--enable-auto-tool-choice --tool-call-parser mistral \
--reasoning-parser mistral
Useful flags:
--max-model-len: default 262144; lower it (e.g. 65536) to free VRAM for larger batch sizes on tighter GPU pools.--language-model-only: skip the vision encoder entirely for text-only workloads.--mm-encoder-tp-mode data: run the small vision encoder data-parallel instead of tensor-parallel — avoids the all-reduce overhead.--limit-mm-per-prompt.image N: cap images per request.
EAGLE speculative decoding
Mistral ships a dedicated EAGLE draft head at
mistralai/Mistral-Medium-3.5-128B-EAGLE.
It is not included in the default config — toggle the spec_decoding feature.
Mistral's recommended serve command (from the EAGLE model card):
vllm serve mistralai/Mistral-Medium-3.5-128B --tensor-parallel-size 8 \
--tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral \
--max_num_batched_tokens 16384 --max_num_seqs 128 --gpu_memory_utilization 0.8 \
--speculative_config '{"model":"mistralai/Mistral-Medium-3.5-128B-EAGLE","num_speculative_tokens":3,"method":"eagle","max_model_len":"65536"}'
The draft model is a 2-layer Mistral-style head trained on the 128B target; it shares the tokenizer and runs at TP=8 alongside the target.
Client usage
Reasoning + tool calling against the OpenAI-compatible endpoint:
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
model="mistralai/Mistral-Medium-3.5-128B",
messages=[{"role": "user", "content": "Plan a 3-day Paris trip."}],
extra_body={"reasoning_effort": "high"},
temperature=0.7, max_tokens=4096,
)
msg = resp.choices[0].message
print("reasoning:", getattr(msg, "reasoning_content", None))
print("answer:", msg.content)
Image input (vision):
resp = client.chat.completions.create(
model="mistralai/Mistral-Medium-3.5-128B",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://..."}},
{"type": "text", "text": "Describe this image."},
],
}],
max_tokens=512,
)
Troubleshooting
- OOM at full 256K context on H200: drop
--max-model-lento 131072 or 65536, or set--language-model-onlyif you don't need vision. reasoning_effortrejected: only"none"and"high"are accepted by the chat template — anything else raises an exception.