OpenMOSS-Team/MOSS-VoiceGenerator

OpenMOSS's 1.7B zero-shot voice-design model — generate diverse voices and styles directly from a text prompt with no reference speech — served via vLLM-Omni through the OpenAI /v1/audio/speech API (24 kHz mono).

View on HuggingFace

dense1.7B0 ctxvLLM 0.22.0+vLLM-Omninightlyomni

Guide

Overview

MOSS-VoiceGenerator is the voice-design member of OpenMOSS's MOSS-TTS Family, served through vLLM-Omni with the OpenAI-compatible /v1/audio/speech API. It generates diverse voices and styles directly from a text prompt, with no reference speech, unifying voice design, style control, and synthesis. It can run standalone or as a design layer that produces a voice for the downstream TTS members of the family. Output is 24 kHz mono.

Prerequisites

Hardware: a single CUDA GPU. ~6 GB for the 1.7B talker + ~8 GB for the codec decoder (comparable to MOSS-TTS-Realtime).
vLLM-Omni targeting vLLM >= 0.22.

Installation

uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git

Launch the server

MOSS-VoiceGenerator shares model_type=moss_tts_delay, so pass its deploy config explicitly:

vllm serve OpenMOSS-Team/MOSS-VoiceGenerator --omni \
  --deploy-config vllm_omni/deploy/moss_voice_generator.yaml --port 8000

Client usage

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "OpenMOSS-Team/MOSS-VoiceGenerator",
    "input": "Hello, this voice was designed entirely from a text prompt.",
    "instructions": "A warm, friendly young female voice with a calm pace.",
    "response_format": "wav"
  }' --output designed.wav

Known limitations

Output is 24 kHz mono.
No reference speech is used — voice identity comes from the text prompt / instructions. The exact prompt schema follows the upstream model card.

References

Model card
MOSS-TTS Family · GitHub