vLLM/Recipes
Mistral AI

mistralai/Mistral-Medium-3.5-128B

Mistral Medium 3.5 (128B) dense vision-language model with native FP8 weights and 256K context

View on HuggingFace
dense128B262,144 ctxvLLM nightly+multimodal
Guide

Overview

Mistral-Medium-3.5 is a 128B dense vision-language model from Mistral AI. The weights ship pre-quantized to FP8 (E4M3) with the vision tower, multimodal projector, and lm_head retained in BF16. Image input is supported up to 1540x1540 (Pixtral-style encoder, patch size 14). Context length is 256K via YaRN scaling (factor 64x over the 4K base).

Reasoning is opt-in per request via reasoning_effort: "high" — when set, the model emits [THINK]...[/THINK] blocks that the Mistral reasoning parser surfaces as message.reasoning_content. Tool calling uses the [AVAILABLE_TOOLS] / [TOOL_CALLS] chat-template tokens.

Prerequisites

  • Hardware: 8xH200 (recommended) or 4xB200; single B200 / MI300X also fits the weights (~134 GB raw) but leaves little room for the 256K KV cache.
  • vLLM nightly (Mistral 3.5 architecture support has not yet shipped in a stable release).

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto \
  --extra-index-url https://wheels.vllm.ai/nightly

This pulls in mistral_common >= 1.11.1 and transformers >= 5.4.0 automatically.

Launch command

8xH200 (or 8xB200):

vllm serve mistralai/Mistral-Medium-3.5-128B \
  --tensor-parallel-size 8 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral \
  --enable-auto-tool-choice --tool-call-parser mistral \
  --reasoning-parser mistral

Useful flags:

  • --max-model-len: default 262144; lower it (e.g. 65536) to free VRAM for larger batch sizes on tighter GPU pools.
  • --language-model-only: skip the vision encoder entirely for text-only workloads.
  • --mm-encoder-tp-mode data: run the small vision encoder data-parallel instead of tensor-parallel — avoids the all-reduce overhead.
  • --limit-mm-per-prompt.image N: cap images per request.

EAGLE speculative decoding

Mistral ships a dedicated EAGLE draft head at mistralai/Mistral-Medium-3.5-128B-EAGLE. It is not included in the default config — toggle the spec_decoding feature.

Mistral's recommended serve command (from the EAGLE model card):

vllm serve mistralai/Mistral-Medium-3.5-128B --tensor-parallel-size 8 \
  --tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral \
  --max_num_batched_tokens 16384 --max_num_seqs 128 --gpu_memory_utilization 0.8 \
  --speculative_config '{"model":"mistralai/Mistral-Medium-3.5-128B-EAGLE","num_speculative_tokens":3,"method":"eagle","max_model_len":"65536"}'

The draft model is a 2-layer Mistral-style head trained on the 128B target; it shares the tokenizer and runs at TP=8 alongside the target.

Client usage

Reasoning + tool calling against the OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="mistralai/Mistral-Medium-3.5-128B",
    messages=[{"role": "user", "content": "Plan a 3-day Paris trip."}],
    extra_body={"reasoning_effort": "high"},
    temperature=0.7, max_tokens=4096,
)
msg = resp.choices[0].message
print("reasoning:", getattr(msg, "reasoning_content", None))
print("answer:", msg.content)

Image input (vision):

resp = client.chat.completions.create(
    model="mistralai/Mistral-Medium-3.5-128B",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://..."}},
            {"type": "text", "text": "Describe this image."},
        ],
    }],
    max_tokens=512,
)

Troubleshooting

  • OOM at full 256K context on H200: drop --max-model-len to 131072 or 65536, or set --language-model-only if you don't need vision.
  • reasoning_effort rejected: only "none" and "high" are accepted by the chat template — anything else raises an exception.

References