vLLM/Recipes
Mistral AI

mistralai/Voxtral-Mini-4B-Realtime-2602

Multilingual realtime speech transcription (13 languages) with a natively streaming causal audio encoder; configurable 80ms–2.4s transcription delay served via vLLM's Realtime API

Matches offline open-source ASR accuracy at 480ms delay; >12.5 tok/s on a single 16GB GPU

View on HuggingFace
dense4.4B131,072 ctxvLLM 0.20.0+multimodal
Guide

Overview

Voxtral Mini 4B Realtime is Mistral AI's multilingual realtime speech transcription model — among the first open-source ASR systems to hit accuracy comparable to offline models with a <500ms end-to-end delay.

  • Architecture: ≈3.4B Mistral text LM + ≈970M custom causal audio encoder, both using sliding-window attention to support "infinite" streaming.
  • Languages: 13 (English, French, Spanish, German, Russian, Chinese, Japanese, Italian, Portuguese, Dutch, Arabic, Hindi, Korean).
  • Configurable delay: any multiple of 80ms between 80ms and 1200ms, plus 2400ms as a standalone value. Default is 480ms, the sweet spot Mistral identified between latency and accuracy.
  • Context: 131072 tokens (≈3h of audio at 80ms/token).

Prerequisites

  • Hardware: a single GPU with ≥ 16 GB VRAM (BF16 weights only).
  • vLLM: >= 0.20.0 — the Voxtral Realtime architecture has been registered since v0.16.0, but v0.20.0 is the first stable release with the architecture documented in the supported-models list.

Install vLLM and audio dependencies

uv venv && source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto
uv pip install -U "mistral-common[audio]>=1.9.0" transformers

Verify the audio extras pulled mistral_common >= 1.9.0:

python -c "import mistral_common; print(mistral_common.__version__)"

Launch command

VLLM_DISABLE_COMPILE_CACHE=1 vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 \
  --tokenizer-mode mistral \
  --compilation_config '{"cudagraph_mode": "PIECEWISE"}'

--tokenizer-mode mistral is required: Voxtral Realtime's tokenizer only loads through mistral_common. Omitting it raises a tokenizer initialization error at startup.

Once it starts you should see the Realtime API route registered:

Route: /v1/realtime, Endpoint: realtime_endpoint

Tuning flags

  • --max-num-batched-tokens — balance throughput vs latency (higher means more throughput at the cost of per-request latency).
  • --max-model-len — defaults to 131072 (≈3h). Reduce it if you know your sessions are shorter; this cuts the memory reserved for pre-computed RoPE frequencies. As a rule of thumb, one text token ≈ 80ms of audio, so a 1h meeting needs --max-model-len >= 3600/0.08 = 45000.

Client usage

  • Always set temperature=0.0.
  • Use WebSockets against /v1/realtime for streaming audio sessions.
  • Adjust the transcription delay by editing the transcription_delay_ms field in the model's tekken.json to any multiple of 80ms in [80, 1200], or to 2400.

Stream an audio file

See vLLM's Realtime audio file client example.

Live microphone (Gradio demo)

See vLLM's Realtime microphone client example for an end-to-end live-transcription UI.

Benchmarks (Fleurs, average WER)

DelayAVGEnglishFrenchChineseJapanese
160ms12.60%6.46%9.75%17.67%19.17%
240ms10.80%5.91%8.00%13.84%15.17%
480ms8.72%4.90%6.42%10.45%9.59%
960ms7.70%4.34%5.68%8.99%6.80%
2400ms6.73%4.05%5.23%8.48%5.50%

At 480ms Voxtral Mini Realtime matches Mistral's offline Voxtral Transcribe 2.0 on Long-form English and Short-form English benchmarks (within 1 WER point on TEDLIUM, Meanwhile, AMI IHM, etc.).

Troubleshooting

  • Route: /v1/realtime not registered — your vLLM is < 0.16.0. Upgrade to 0.20.0+.
  • Tokenizer initialization error — you forgot --tokenizer-mode mistral. Voxtral Realtime's tokenizer can only load through mistral_common.
  • mistral_common import error / wrong version — install with the audio extras: pip install -U "mistral-common[audio]>=1.9.0".
  • Transformers v4 warning spam — upgrade to Transformers v5 (uv pip install -U transformers), tracked in vllm-project/vllm#34642.
  • Hangs / crashes on long sessions — known upstream issue, see #39996 (encoder KV cache eviction) and #38233 (multi-session encode). Restart sessions periodically as a workaround.

References