Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

Text-to-speech served via vLLM-Omni with predefined speaker voices and optional style/emotion control, exposed through the OpenAI /v1/audio/speech API. Sibling VoiceDesign and Base (voice-clone) checkpoints share the same serving path.

View on HuggingFace

dense1.7B0 ctxvLLM 0.20.0+vLLM-Omninightlyomni

Guide

Overview

Qwen3-TTS is a text-to-speech family served through vLLM-Omni with the OpenAI-compatible /v1/audio/speech API. Three task types are each backed by a dedicated checkpoint:

Task type	Model	Use case
CustomVoice	`Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`	Predefined speaker voices + optional style/emotion
VoiceDesign	`Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign`	Speech from a natural-language voice description
Base	`Qwen/Qwen3-TTS-12Hz-1.7B-Base`	Voice cloning from reference audio + transcript

Smaller 0.6B variants are also published for CustomVoice and Base. The server serves one checkpoint at a time — switch task type by restarting with the matching model.

Prerequisites

Hardware: a single CUDA GPU (the 1.7B talker + codec decoder share one device; the bundled deploy config requests gpu_memory_utilization: 0.3 per stage).
vLLM-Omni targeting vLLM >= 0.20. See the installation guide.

Use the Variant selector above to switch checkpoint (CustomVoice / VoiceDesign / Base, 1.7B or 0.6B); the served model_id updates accordingly.

Installation

uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git

Launch the server

The two-stage deploy config (vllm_omni/deploy/qwen3_tts.yaml, async chunking on for low first-audio latency) is auto-discovered from model_type=qwen3_tts:

# CustomVoice (predefined speakers)
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --omni --port 8000

# VoiceDesign (natural-language voice description)
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --omni --port 8000

# Base (voice cloning from reference audio)
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base --omni --port 8000

vllm-omni serve ... is an equivalent alias for vllm serve ... --omni.

Client usage

Predefined voice (curl)

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", "input": "Hello, how are you?", "voice": "vivian", "language": "English"}' \
  --output output.wav

Emotion / style instruction

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", "input": "I am so excited!", "voice": "vivian", "instructions": "Speak with great enthusiasm"}' \
  --output excited.wav

List available voices

curl http://localhost:8000/v1/audio/voices

Streaming PCM (low latency)

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", "input": "Hello, how are you?", "voice": "vivian", "language": "English", "stream": true, "response_format": "pcm"}' \
  --no-buffer | play -t raw -r 24000 -e signed -b 16 -c 1 -

Streaming requires stream: true with response_format: "pcm" (24 kHz mono).

Python client and offline inference

# Online client (CustomVoice / VoiceDesign / Base)
python examples/online_serving/text_to_speech/qwen3_tts/openai_speech_client.py \
  --text "Hello, how are you?" --speaker vivian --language English

# Offline (no server)
python examples/offline_inference/text_to_speech/qwen3_tts/end2end.py \
  --query-type CustomVoice [--streaming]

Known limitations

Each task type requires its matching checkpoint — a CustomVoice model cannot serve a Base (voice-clone) request.
One model variant is served per process; switching task type means restarting.