OpenMOSS-Team/MOSS-TTS

Flagship 8B model of the OpenMOSS MOSS-TTS Family — high-fidelity zero-shot voice cloning, ultra-long stable speech, token-level duration and phoneme control, 20-language code-switched synthesis — served via vLLM-Omni through the OpenAI /v1/audio/speech API (24 kHz mono).

View on HuggingFace

dense8B0 ctxvLLM 0.22.0+vLLM-Omninightlyomni

Guide

Overview

MOSS-TTS is the flagship 8B model of OpenMOSS's MOSS-TTS Family, a production-grade TTS foundation model served through vLLM-Omni with the OpenAI-compatible /v1/audio/speech API. It delivers high-fidelity zero-shot voice cloning, ultra-long stable speech (up to ~1 hour), token-level duration control, Pinyin/IPA phoneme control, and 20-language / code-switched synthesis. Output is 24 kHz mono via the shared OpenMOSS-Team/MOSS-Audio-Tokenizer codec (~7 GB, auto-downloaded; override with MOSS_TTS_CODEC_PATH).

The MOSS-TTS Family ships five production-ready models, each with its own recipe: MOSS-TTS (this page, flagship TTS), MOSS-TTS-Realtime (low-latency streaming), MOSS-TTSD (multi-speaker dialogue), MOSS-VoiceGenerator (zero-shot voice design), and MOSS-SoundEffect (sound-effect synthesis).

MOSS-TTS is released in two architectures: the 8B MossTTSDelay (this checkpoint, production-recommended) and a 1.7B MossTTSLocal (OpenMOSS-Team/MOSS-TTS-Local-Transformer) tuned for streaming-oriented systems. A MOSS-TTS-v1.5 point release adds 31-language support and steadier cloning with [pause Xs] markers (same serving path; set language for best results).

Prerequisites

Hardware: a single CUDA GPU. Reference run: 1x H100 80 GB — ~18 GB for the 8B talker + ~8 GB for the codec decoder on the same device (gpu_memory_utilization: 0.85 in moss_tts.yaml).
vLLM-Omni targeting vLLM >= 0.22.

Installation

uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git

Launch the server

All MossTTSDelay checkpoints (MOSS-TTS, MOSS-TTSD, MOSS-SoundEffect, MOSS-VoiceGenerator) share model_type=moss_tts_delay, so the deploy config must be passed explicitly to select the right pipeline:

vllm serve OpenMOSS-Team/MOSS-TTS --omni \
  --deploy-config vllm_omni/deploy/moss_tts.yaml --port 8000

Client usage

Voice cloning (curl)

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "OpenMOSS-Team/MOSS-TTS",
    "input": "Hello, this is a voice cloning test.",
    "voice": "default",
    "ref_audio": "https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS/main/assets/audio/zh_1.wav",
    "response_format": "wav"
  }' --output output.wav

Known limitations

Output is 24 kHz mono.
MOSS_TTS_CODEC_PATH overrides the codec checkpoint location if you have a local copy.