OpenMOSS-Team/MOSS-TTSD-v1.0
OpenMOSS's 8B spoken-dialogue generation model for expressive, multi-speaker, ultra-long conversations — served via vLLM-Omni through the OpenAI /v1/audio/speech API (24 kHz mono).
Guide
Overview
MOSS-TTSD-v1.0 is the
spoken-dialogue member of OpenMOSS's MOSS-TTS Family, served through
vLLM-Omni with the OpenAI-compatible /v1/audio/speech API. It generates
expressive, multi-speaker, ultra-long dialogues and (per OpenMOSS) leads on
objective metrics while outperforming top closed-source systems in subjective
evaluations. Output is 24 kHz mono.
Dialogue formatting (speaker turns, e.g. [S1] ... [S2] ...) and any
multi-speaker reference conditioning follow the upstream
MOSS-TTSD conventions — consult the
upstream repo for the exact turn/reference schema.
Prerequisites
- Hardware: a single CUDA GPU comparable to the 8B MOSS-TTS profile (~18 GB talker + ~8 GB codec; e.g. an 80 GB H100 with headroom for long dialogues).
- vLLM-Omni targeting vLLM >= 0.22.
Installation
uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git
Launch the server
MOSS-TTSD shares model_type=moss_tts_delay with the other MossTTSDelay
checkpoints, so pass its deploy config explicitly:
vllm serve OpenMOSS-Team/MOSS-TTSD-v1.0 --omni \
--deploy-config vllm_omni/deploy/moss_ttsd.yaml --port 8000
Client usage
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "OpenMOSS-Team/MOSS-TTSD-v1.0",
"input": "[S1] Hi, how was your day? [S2] Pretty good, thanks for asking!",
"voice": "default",
"response_format": "wav"
}' --output dialogue.wav
Known limitations
- Output is 24 kHz mono.
- Shares the
OpenMOSS-Team/MOSS-Audio-Tokenizercodec (~7 GB, auto-downloaded; override withMOSS_TTS_CODEC_PATH).