vLLM/Recipes
Mistral AI

mistralai/Ministral-3-14B-Instruct-2512

Ministral 3 Instruct family (3B/8B/14B) with FP8 weights, vision support, and 256K context

View on HuggingFace
dense14B262,144 ctxvLLM 0.11.0+multimodal
Guide

Overview

Ministral-3 Instruct comes with FP8 weights in 3 different sizes:

  • 3B: tied embeddings (shares embedding and output layers)
  • 8B and 14B: independent embedding and output layers

Each variant has vision support and a 256K context length. Smaller models offer faster inference at the cost of lower quality; pick the best trade-off for your use case.

Prerequisites

  • Hardware: 1x H200 (sufficient for all three sizes thanks to FP8 weights)
  • vLLM >= 0.11.0

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Launch command

vllm serve mistralai/Ministral-3-14B-Instruct-2512 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral \
  --enable-auto-tool-choice --tool-call-parser mistral

For 8B: mistralai/Ministral-3-8B-Instruct-2512 For 3B: mistralai/Ministral-3-3B-Instruct-2512

  • enable-auto-tool-choice: required for tool usage
  • tool-call-parser mistral: required for tool usage
  • --max-model-len defaults to 262144; reduce to save memory
  • --max-num-batched-tokens balances throughput and latency

Client Usage

Vision reasoning example:

from datetime import datetime, timedelta
from openai import OpenAI
from huggingface_hub import hf_hub_download

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id

def load_system_prompt(repo_id, filename):
    path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(path) as f:
        prompt = f.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    return prompt.format(name=repo_id.split("/")[-1], today=today, yesterday=yesterday)

SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

response = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": [
            {"type": "text", "text": "What action should I take here?"},
            {"type": "image_url", "image_url": {"url": image_url}},
        ]},
    ],
    temperature=0.15, max_tokens=262144,
)
print(response.choices[0].message.content)

Function calling and text-only examples follow a similar OpenAI-compatible pattern.

Troubleshooting

  • OOM: lower --max-model-len (e.g. 32768) or use the 3B/8B variant.

References