vLLM/Recipes
Mistral AI

mistralai/Ministral-3-14B-Instruct-2512

Ministral 3 Instruct family (3B/8B/14B) with FP8 weights, vision support, and 256K context

dense14B262,144 ctxvLLM 0.11.0+multimodal
Guide

Overview

Ministral-3 Instruct comes with FP8 weights in 3 different sizes:

  • 3B: tied embeddings (shares embedding and output layers)
  • 8B and 14B: independent embedding and output layers

Each variant has vision support and a 256K context length. Smaller models offer faster inference at the cost of lower quality; pick the best trade-off for your use case.

Prerequisites

  • Hardware: 1x H200 (sufficient for all three sizes thanks to FP8 weights); 1x MI300X (verified) / MI325X / MI355X
  • vLLM >= 0.11.0

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Launch command

vllm serve mistralai/Ministral-3-14B-Instruct-2512 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral \
  --enable-auto-tool-choice --tool-call-parser mistral

For 8B: mistralai/Ministral-3-8B-Instruct-2512 For 3B: mistralai/Ministral-3-3B-Instruct-2512

  • enable-auto-tool-choice: required for tool usage
  • tool-call-parser mistral: required for tool usage
  • --max-model-len defaults to 262144; reduce to save memory
  • --max-num-batched-tokens balances throughput and latency

AMD (MI300X / MI325X / MI355X)

Verified on an 8-GPU MI300X node with TP=1 per variant.

3B

docker run --device=/dev/kfd --device=/dev/dri \
  --security-opt seccomp=unconfined --group-add video \
  --privileged --ipc=host -p 8000:8000 \
  -e VLLM_ROCM_USE_AITER=1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-rocm:latest mistralai/Ministral-3-3B-Instruct-2512 \
    --tokenizer_mode mistral \
    --tensor-parallel-size 1 \
    --config_format mistral \
    --load_format mistral \
    --enable-auto-tool-choice \
    --tool-call-parser mistral

8B

docker run --device=/dev/kfd --device=/dev/dri \
  --security-opt seccomp=unconfined --group-add video \
  --privileged --ipc=host -p 8000:8000 \
  -e VLLM_ROCM_USE_AITER=1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-rocm:latest mistralai/Ministral-3-8B-Instruct-2512 \
    --tokenizer_mode mistral \
    --tensor-parallel-size 1 \
    --config_format mistral \
    --load_format mistral \
    --max-model-len auto \
    --max-num-batched-tokens 8192 \
    --enable-auto-tool-choice \
    --tool-call-parser mistral

14B

docker run --device=/dev/kfd --device=/dev/dri \
  --security-opt seccomp=unconfined --group-add video \
  --privileged --ipc=host -p 8000:8000 \
  -e VLLM_ROCM_USE_AITER=1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-rocm:latest mistralai/Ministral-3-14B-Instruct-2512 \
    --tokenizer_mode mistral \
    --tensor-parallel-size 1 \
    --config_format mistral \
    --load_format mistral \
    --max-num-seqs 256 \
    --max-model-len auto \
    --gpu-memory-utilization 0.95 \
    --max-num-batched-tokens 8192 \
    --enable-auto-tool-choice \
    --tool-call-parser mistral

Client Usage

Vision reasoning example:

from datetime import datetime, timedelta
from openai import OpenAI
from huggingface_hub import hf_hub_download

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id

def load_system_prompt(repo_id, filename):
    path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(path) as f:
        prompt = f.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    return prompt.format(name=repo_id.split("/")[-1], today=today, yesterday=yesterday)

SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

response = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": [
            {"type": "text", "text": "What action should I take here?"},
            {"type": "image_url", "image_url": {"url": image_url}},
        ]},
    ],
    temperature=0.15, max_tokens=262144,
)
print(response.choices[0].message.content)

Function calling and text-only examples follow a similar OpenAI-compatible pattern.

Benchmarking (MI300X verification)

Serving benchmarks used vllm bench serve with random 1024-token input/output, --max-concurrency 32, and --num-prompts 100 against each variant above. Accuracy used lm-eval GSM8K (5-shot, flexible-extract / strict-match filters).

Throughput

VariantOutput tok/sMean TTFT (ms)Mean TPOT (ms)
3B38422886.42
8B246811179.48
14B1941122912.22

GSM8K (5-shot exact_match)

Variantflexible-extractstrict-match
3B0.7786 ± 0.01140.7445 ± 0.0120
8B0.8560 ± 0.00970.8491 ± 0.0099
14B0.8795 ± 0.00900.8764 ± 0.0091

14B full vllm bench serve output (TP=1, MI300X):

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             32
Benchmark duration (s):                  52.76
Total input tokens:                      102400
Total generated tokens:                  102400
Request throughput (req/s):              1.90
Output token throughput (tok/s):         1940.79
Peak output token throughput (tok/s):    3126.00
Peak concurrent requests:                64.00
Total token throughput (tok/s):          3881.58
---------------Time to First Token----------------
Mean TTFT (ms):                          1228.72
Median TTFT (ms):                        952.25
P99 TTFT (ms):                           2925.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.22
Median TPOT (ms):                        12.25
P99 TPOT (ms):                           12.83
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.22
Median ITL (ms):                         11.78
P99 ITL (ms):                            13.66
==================================================

Troubleshooting

  • OOM: lower --max-model-len (e.g. 32768) or use the 3B/8B variant.

References