vLLM/Recipes
Mistral AI

mistralai/Voxtral-4B-TTS-2603

Mistral's 4B text-to-speech model served via vLLM-Omni with built-in voice presets, exposed through the OpenAI /v1/audio/speech API (24 kHz mono).

Guide

Overview

Voxtral-4B-TTS-2603 is Mistral AI's 4B text-to-speech model, served through vLLM-Omni with the OpenAI-compatible /v1/audio/speech API. It generates 24 kHz mono audio from text using model-provided voice presets.

Prerequisites

  • Hardware: a single CUDA GPU with ≥ 24 GB VRAM. Reference run: 1× RTX 4090 24 GB — Stage 0 (audio_generation) ~18.95 GiB, Stage 1 (audio_tokenizer) ~1.55 GiB, server-startup peak ~20.5 GiB (~85%). Both stages share GPU 0 via the bundled deploy config.
  • vLLM-Omni targeting vLLM >= 0.20, with mistral_common >= 1.10.0.

Installation

uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git
uv pip install -U "mistral_common>=1.10.0"

Launch the server

The deploy config (vllm_omni/deploy/voxtral_tts.yaml) is auto-discovered:

vllm serve mistralai/Voxtral-4B-TTS-2603 --omni --port 8000

Client usage

Voice preset (curl)

curl -X POST http://127.0.0.1:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Voxtral-4B-TTS-2603",
    "input": "Hello, this is Voxtral TTS running with vLLM-Omni.",
    "voice": "casual_female",
    "language": "English",
    "response_format": "wav"
  }' --output voxtral.wav

Streaming PCM

curl -X POST http://127.0.0.1:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Voxtral-4B-TTS-2603",
    "input": "Hello, this is Voxtral TTS streaming PCM.",
    "voice": "casual_female",
    "language": "English",
    "stream": true,
    "response_format": "pcm"
  }' --output voxtral_stream.pcm

# Convert raw 24 kHz mono PCM to WAV
ffmpeg -f s16le -ar 24000 -ac 1 -i voxtral_stream.pcm voxtral_stream.wav -y

Examples

Known limitations

  • Output is mono 24 kHz.
  • Voice cloning is gated upstream. With the public mistralai/Voxtral-4B-TTS-2603 checkpoint, supplying a ref_audio (or running the voice-clone path) fails with RuntimeError: encode_waveforms requires encoder weights which are not available in the open-source checkpoint. — the public weights omit the encoder needed to turn reference audio into conditioning features. Use the built-in voice presets instead.

References