Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
Text-to-speech served via vLLM-Omni with predefined speaker voices and optional style/emotion control, exposed through the OpenAI /v1/audio/speech API. Sibling VoiceDesign and Base (voice-clone) checkpoints share the same serving path.
Guide
Overview
Qwen3-TTS is a
text-to-speech family served through vLLM-Omni with the OpenAI-compatible
/v1/audio/speech API. Three task types are each backed by a dedicated
checkpoint:
| Task type | Model | Use case |
|---|---|---|
| CustomVoice | Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice | Predefined speaker voices + optional style/emotion |
| VoiceDesign | Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign | Speech from a natural-language voice description |
| Base | Qwen/Qwen3-TTS-12Hz-1.7B-Base | Voice cloning from reference audio + transcript |
Smaller 0.6B variants are also published for CustomVoice and Base. The server serves one checkpoint at a time — switch task type by restarting with the matching model.
Prerequisites
- Hardware: a single CUDA GPU (the 1.7B talker + codec decoder share one
device; the bundled deploy config requests
gpu_memory_utilization: 0.3per stage). - vLLM-Omni targeting vLLM >= 0.20. See the installation guide.
Use the Variant selector above to switch checkpoint (CustomVoice /
VoiceDesign / Base, 1.7B or 0.6B); the served model_id updates accordingly.
Installation
uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git
Launch the server
The two-stage deploy config (vllm_omni/deploy/qwen3_tts.yaml, async chunking
on for low first-audio latency) is auto-discovered from model_type=qwen3_tts:
# CustomVoice (predefined speakers)
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --omni --port 8000
# VoiceDesign (natural-language voice description)
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --omni --port 8000
# Base (voice cloning from reference audio)
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base --omni --port 8000
vllm-omni serve ...is an equivalent alias forvllm serve ... --omni.
Client usage
Predefined voice (curl)
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", "input": "Hello, how are you?", "voice": "vivian", "language": "English"}' \
--output output.wav
Emotion / style instruction
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", "input": "I am so excited!", "voice": "vivian", "instructions": "Speak with great enthusiasm"}' \
--output excited.wav
List available voices
curl http://localhost:8000/v1/audio/voices
Streaming PCM (low latency)
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", "input": "Hello, how are you?", "voice": "vivian", "language": "English", "stream": true, "response_format": "pcm"}' \
--no-buffer | play -t raw -r 24000 -e signed -b 16 -c 1 -
Streaming requires stream: true with response_format: "pcm" (24 kHz mono).
Python client and offline inference
# Online client (CustomVoice / VoiceDesign / Base)
python examples/online_serving/text_to_speech/qwen3_tts/openai_speech_client.py \
--text "Hello, how are you?" --speaker vivian --language English
# Offline (no server)
python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py \
--query-type CustomVoice [--streaming]
Known limitations
- Each task type requires its matching checkpoint — a CustomVoice model cannot serve a Base (voice-clone) request.
- One model variant is served per process; switching task type means restarting.