zai-org/GLM-TTS
Z-AI's two-stage (AR + DiT flow-matching) zero-shot voice-cloning TTS for Chinese and English, served via vLLM-Omni through the OpenAI /v1/audio/speech API. Every request is conditioned on reference audio + its transcript.
Guide
Overview
GLM-TTS is Z-AI's two-stage
text-to-speech system — an LLM that generates speech tokens (AR) followed by a
Flow-Matching DiT that produces mel-spectrograms, vocoded by HiFT — served
through vLLM-Omni with the OpenAI-compatible /v1/audio/speech API. It
targets Chinese and English, and every request is conditioned on
reference audio + its transcript (zero-shot voice cloning).
Prerequisites
- Hardware: a single CUDA GPU. Reference run: 1x A40 48 GB — ~18-20 GB total (AR ~10 GB + DiT ~8 GB), ~16.6 GiB peak; fits 24 GB cards. Both stages share GPU 0 by default; split them across GPUs for higher concurrency.
- vLLM-Omni targeting vLLM >= 0.22.
--trust-remote-codeis required for the GLM-TTS phoneme tokenizer.
Installation
uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git
Launch the server
The deploy config (vllm_omni/deploy/glm_tts.yaml) is auto-discovered. Async
chunking (streaming) is on by default:
vllm serve zai-org/GLM-TTS --omni --trust-remote-code --port 8000
For the synchronous (non-streaming) path:
vllm serve zai-org/GLM-TTS --omni --trust-remote-code --port 8000 --no-async-chunk
Client usage
Zero-shot voice clone (curl)
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "zai-org/GLM-TTS",
"input": "你好,这是一个语音合成测试。",
"response_format": "wav",
"ref_audio": "file:///path/to/ref.wav",
"ref_text": "这是参考音频的文本内容。"
}' --output test.wav
Python client
python examples/online_serving/text_to_speech/glm_tts/openai_speech_client.py \
--text "你好,这是语音克隆测试。" \
--ref-audio file:///path/to/ref.wav \
--ref-text "这是参考音频的文本内容。"
Known limitations
- Output is 24 kHz mono WAV via the HiFT vocoder (Vocos2D 32 kHz fallback with resampling).
- Voice cloning requires
ref_audioandref_texttogether; the reference clip should be 3-10 seconds. Feature extraction (WhisperVQ tokenizer, CampPlus ONNX, mel) runs model-side. - The first request may be slow due to lazy model loading (WhisperVQ, CampPlus ONNX). Warm-cache RTF is ~0.6-0.7x on an A40.