OpenMOSS-Team/MOSS-VoiceGenerator
OpenMOSS's 1.7B zero-shot voice-design model — generate diverse voices and styles directly from a text prompt with no reference speech — served via vLLM-Omni through the OpenAI /v1/audio/speech API (24 kHz mono).
Guide
Overview
MOSS-VoiceGenerator
is the voice-design member of OpenMOSS's MOSS-TTS Family, served through
vLLM-Omni with the OpenAI-compatible /v1/audio/speech API. It generates
diverse voices and styles directly from a text prompt, with no reference
speech, unifying voice design, style control, and synthesis. It can run
standalone or as a design layer that produces a voice for the downstream TTS
members of the family. Output is 24 kHz mono.
Prerequisites
- Hardware: a single CUDA GPU. ~6 GB for the 1.7B talker + ~8 GB for the codec decoder (comparable to MOSS-TTS-Realtime).
- vLLM-Omni targeting vLLM >= 0.22.
Installation
uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git
Launch the server
MOSS-VoiceGenerator shares model_type=moss_tts_delay, so pass its deploy
config explicitly:
vllm serve OpenMOSS-Team/MOSS-VoiceGenerator --omni \
--deploy-config vllm_omni/deploy/moss_voice_generator.yaml --port 8000
Client usage
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "OpenMOSS-Team/MOSS-VoiceGenerator",
"input": "Hello, this voice was designed entirely from a text prompt.",
"instructions": "A warm, friendly young female voice with a calm pace.",
"response_format": "wav"
}' --output designed.wav
Known limitations
- Output is 24 kHz mono.
- No reference speech is used — voice identity comes from the text prompt /
instructions. The exact prompt schema follows the upstream model card.