OpenMOSS-Team/MOSS-TTSD-v1.0

OpenMOSS's 8B spoken-dialogue generation model for expressive, multi-speaker, ultra-long conversations — served via vLLM-Omni through the OpenAI /v1/audio/speech API (24 kHz mono).

View on HuggingFace

dense8B0 ctxvLLM 0.22.0+vLLM-Omninightlyomni

Guide

Overview

MOSS-TTSD-v1.0 is the spoken-dialogue member of OpenMOSS's MOSS-TTS Family, served through vLLM-Omni with the OpenAI-compatible /v1/audio/speech API. It generates expressive, multi-speaker, ultra-long dialogues and (per OpenMOSS) leads on objective metrics while outperforming top closed-source systems in subjective evaluations. Output is 24 kHz mono.

Dialogue formatting (speaker turns, e.g. [S1] ... [S2] ...) and any multi-speaker reference conditioning follow the upstream MOSS-TTSD conventions — consult the upstream repo for the exact turn/reference schema.

Prerequisites

Hardware: a single CUDA GPU comparable to the 8B MOSS-TTS profile (~18 GB talker + ~8 GB codec; e.g. an 80 GB H100 with headroom for long dialogues).
vLLM-Omni targeting vLLM >= 0.22.

Installation

uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git

Launch the server

MOSS-TTSD shares model_type=moss_tts_delay with the other MossTTSDelay checkpoints, so pass its deploy config explicitly:

vllm serve OpenMOSS-Team/MOSS-TTSD-v1.0 --omni \
  --deploy-config vllm_omni/deploy/moss_ttsd.yaml --port 8000

Client usage

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "OpenMOSS-Team/MOSS-TTSD-v1.0",
    "input": "[S1] Hi, how was your day? [S2] Pretty good, thanks for asking!",
    "voice": "default",
    "response_format": "wav"
  }' --output dialogue.wav

Known limitations

Output is 24 kHz mono.
Shares the OpenMOSS-Team/MOSS-Audio-Tokenizer codec (~7 GB, auto-downloaded; override with MOSS_TTS_CODEC_PATH).