mistralai/Voxtral-4B-TTS-2603
Mistral's 4B text-to-speech model served via vLLM-Omni with built-in voice presets, exposed through the OpenAI /v1/audio/speech API (24 kHz mono).
Guide
Overview
Voxtral-4B-TTS-2603 is
Mistral AI's 4B text-to-speech model, served through vLLM-Omni with the
OpenAI-compatible /v1/audio/speech API. It generates 24 kHz mono audio
from text using model-provided voice presets.
Prerequisites
- Hardware: a single CUDA GPU with ≥ 24 GB VRAM. Reference run: 1×
RTX 4090 24 GB — Stage 0 (
audio_generation) ~18.95 GiB, Stage 1 (audio_tokenizer) ~1.55 GiB, server-startup peak ~20.5 GiB (~85%). Both stages share GPU 0 via the bundled deploy config. - vLLM-Omni targeting vLLM >= 0.20, with
mistral_common >= 1.10.0.
Installation
uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git
uv pip install -U "mistral_common>=1.10.0"
Launch the server
The deploy config (vllm_omni/deploy/voxtral_tts.yaml) is auto-discovered:
vllm serve mistralai/Voxtral-4B-TTS-2603 --omni --port 8000
Client usage
Voice preset (curl)
curl -X POST http://127.0.0.1:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Voxtral-4B-TTS-2603",
"input": "Hello, this is Voxtral TTS running with vLLM-Omni.",
"voice": "casual_female",
"language": "English",
"response_format": "wav"
}' --output voxtral.wav
Streaming PCM
curl -X POST http://127.0.0.1:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Voxtral-4B-TTS-2603",
"input": "Hello, this is Voxtral TTS streaming PCM.",
"voice": "casual_female",
"language": "English",
"stream": true,
"response_format": "pcm"
}' --output voxtral_stream.pcm
# Convert raw 24 kHz mono PCM to WAV
ffmpeg -f s16le -ar 24000 -ac 1 -i voxtral_stream.pcm voxtral_stream.wav -y
Examples
Known limitations
- Output is mono 24 kHz.
- Voice cloning is gated upstream. With the public
mistralai/Voxtral-4B-TTS-2603checkpoint, supplying aref_audio(or running the voice-clone path) fails withRuntimeError: encode_waveforms requires encoder weights which are not available in the open-source checkpoint.— the public weights omit the encoder needed to turn reference audio into conditioning features. Use the built-in voice presets instead.