mistralai/Voxtral-Mini-4B-Realtime-2602
Multilingual realtime speech transcription (13 languages) with a natively streaming causal audio encoder; configurable 80ms–2.4s transcription delay served via vLLM's Realtime API
Matches offline open-source ASR accuracy at 480ms delay; >12.5 tok/s on a single 16GB GPU
View on HuggingFaceGuide
Overview
Voxtral Mini 4B Realtime is Mistral AI's multilingual realtime speech transcription model — among the first open-source ASR systems to hit accuracy comparable to offline models with a <500ms end-to-end delay.
- Architecture: ≈3.4B Mistral text LM + ≈970M custom causal audio encoder, both using sliding-window attention to support "infinite" streaming.
- Languages: 13 (English, French, Spanish, German, Russian, Chinese, Japanese, Italian, Portuguese, Dutch, Arabic, Hindi, Korean).
- Configurable delay: any multiple of 80ms between 80ms and 1200ms, plus 2400ms as a standalone value. Default is 480ms, the sweet spot Mistral identified between latency and accuracy.
- Context: 131072 tokens (≈3h of audio at 80ms/token).
Prerequisites
- Hardware: a single GPU with ≥ 16 GB VRAM (BF16 weights only).
- vLLM: >= 0.20.0 — the Voxtral Realtime architecture has been registered since v0.16.0, but v0.20.0 is the first stable release with the architecture documented in the supported-models list.
Install vLLM and audio dependencies
uv venv && source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto
uv pip install -U "mistral-common[audio]>=1.9.0" transformers
Verify the audio extras pulled mistral_common >= 1.9.0:
python -c "import mistral_common; print(mistral_common.__version__)"
Launch command
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 \
--tokenizer-mode mistral \
--compilation_config '{"cudagraph_mode": "PIECEWISE"}'
--tokenizer-mode mistralis required: Voxtral Realtime's tokenizer only loads throughmistral_common. Omitting it raises a tokenizer initialization error at startup.
Once it starts you should see the Realtime API route registered:
Route: /v1/realtime, Endpoint: realtime_endpoint
Tuning flags
--max-num-batched-tokens— balance throughput vs latency (higher means more throughput at the cost of per-request latency).--max-model-len— defaults to 131072 (≈3h). Reduce it if you know your sessions are shorter; this cuts the memory reserved for pre-computed RoPE frequencies. As a rule of thumb, one text token ≈ 80ms of audio, so a 1h meeting needs--max-model-len >= 3600/0.08 = 45000.
Client usage
Recommended settings
- Always set
temperature=0.0. - Use WebSockets against
/v1/realtimefor streaming audio sessions. - Adjust the transcription delay by editing the
transcription_delay_msfield in the model'stekken.jsonto any multiple of 80ms in[80, 1200], or to2400.
Stream an audio file
See vLLM's Realtime audio file client example.
Live microphone (Gradio demo)
See vLLM's Realtime microphone client example for an end-to-end live-transcription UI.
Benchmarks (Fleurs, average WER)
| Delay | AVG | English | French | Chinese | Japanese |
|---|---|---|---|---|---|
| 160ms | 12.60% | 6.46% | 9.75% | 17.67% | 19.17% |
| 240ms | 10.80% | 5.91% | 8.00% | 13.84% | 15.17% |
| 480ms | 8.72% | 4.90% | 6.42% | 10.45% | 9.59% |
| 960ms | 7.70% | 4.34% | 5.68% | 8.99% | 6.80% |
| 2400ms | 6.73% | 4.05% | 5.23% | 8.48% | 5.50% |
At 480ms Voxtral Mini Realtime matches Mistral's offline Voxtral Transcribe 2.0 on Long-form English and Short-form English benchmarks (within 1 WER point on TEDLIUM, Meanwhile, AMI IHM, etc.).
Troubleshooting
Route: /v1/realtimenot registered — your vLLM is < 0.16.0. Upgrade to 0.20.0+.- Tokenizer initialization error — you forgot
--tokenizer-mode mistral. Voxtral Realtime's tokenizer can only load throughmistral_common. mistral_commonimport error / wrong version — install with the audio extras:pip install -U "mistral-common[audio]>=1.9.0".- Transformers v4 warning spam — upgrade to Transformers v5
(
uv pip install -U transformers), tracked in vllm-project/vllm#34642. - Hangs / crashes on long sessions — known upstream issue, see #39996 (encoder KV cache eviction) and #38233 (multi-session encode). Restart sessions periodically as a workaround.