OpenMOSS-Team/MOSS-TTS-Realtime

OpenMOSS's 1.7B real-time streaming TTS for low-latency voice agents (TTFB ~180 ms) — multi-turn context-aware incremental synthesis — served via vLLM-Omni through the OpenAI /v1/audio/speech API (24 kHz mono).

View on HuggingFace

dense1.7B0 ctxvLLM 0.22.0+vLLM-Omninightlyomni

Guide

Overview

MOSS-TTS-Realtime is the 1.7B real-time streaming member of OpenMOSS's MOSS-TTS Family, served through vLLM-Omni with the OpenAI-compatible /v1/audio/speech API. It is a multi-turn, context-aware model for low-latency voice agents, using a local depth transformer (no delay warm-up) for incremental synthesis. TTFB (time to first byte) is ~180 ms, and LLM-first-sentence + TTS-TTFB is ~377 ms. Output is 24 kHz mono.

Prerequisites

Hardware: a single CUDA GPU. Reference run: 1x A10G 24 GB — ~6 GB for the 1.7B talker + ~8 GB for the codec decoder.
vLLM-Omni targeting vLLM >= 0.22.

Installation

uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git

Launch the server

MOSS-TTS-Realtime uses model_type=moss_tts_realtime, so its deploy config (vllm_omni/deploy/moss_tts_realtime.yaml, codec_chunk_frames: 15 for low TTFA) is auto-discovered:

vllm serve OpenMOSS-Team/MOSS-TTS-Realtime --omni --port 8000

Client usage

Streaming voice clone (curl)

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "OpenMOSS-Team/MOSS-TTS-Realtime",
    "input": "This is a low-latency streaming TTS test.",
    "voice": "default",
    "ref_audio": "https://raw.githubusercontent.com/OpenMOSS/MOSS-TTS/main/assets/audio/zh_1.wav",
    "response_format": "wav",
    "stream": true
  }' --output output.wav

Known limitations

Output is 24 kHz mono.
Shares the OpenMOSS-Team/MOSS-Audio-Tokenizer codec (~7 GB, auto-downloaded; override with MOSS_TTS_CODEC_PATH).

References

Model card
MOSS-TTS Family · GitHub