OpenMOSS-Team/MOSS-TTS-Realtime
OpenMOSS's 1.7B real-time streaming TTS for low-latency voice agents (TTFB ~180 ms) — multi-turn context-aware incremental synthesis — served via vLLM-Omni through the OpenAI /v1/audio/speech API (24 kHz mono).
Guide
Overview
MOSS-TTS-Realtime is
the 1.7B real-time streaming member of OpenMOSS's MOSS-TTS Family, served
through vLLM-Omni with the OpenAI-compatible /v1/audio/speech API. It is
a multi-turn, context-aware model for low-latency voice agents, using a
local depth transformer (no delay warm-up) for incremental synthesis. TTFB
(time to first byte) is ~180 ms, and LLM-first-sentence + TTS-TTFB is
~377 ms. Output is 24 kHz mono.
Prerequisites
- Hardware: a single CUDA GPU. Reference run: 1x A10G 24 GB — ~6 GB for the 1.7B talker + ~8 GB for the codec decoder.
- vLLM-Omni targeting vLLM >= 0.22.
Installation
uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git
Launch the server
MOSS-TTS-Realtime uses model_type=moss_tts_realtime, so its deploy config
(vllm_omni/deploy/moss_tts_realtime.yaml, codec_chunk_frames: 15 for low
TTFA) is auto-discovered:
vllm serve OpenMOSS-Team/MOSS-TTS-Realtime --omni --port 8000
Client usage
Streaming voice clone (curl)
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "OpenMOSS-Team/MOSS-TTS-Realtime",
"input": "This is a low-latency streaming TTS test.",
"voice": "default",
"ref_audio": "https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS/main/assets/audio/zh_1.wav",
"response_format": "wav",
"stream": true
}' --output output.wav
Known limitations
- Output is 24 kHz mono.
- Shares the
OpenMOSS-Team/MOSS-Audio-Tokenizercodec (~7 GB, auto-downloaded; override withMOSS_TTS_CODEC_PATH).