OpenMOSS-Team/MOSS-SoundEffect
OpenMOSS's 8B sound-effect generation model — environmental, urban, biological, human-action and musical sounds with controllable duration, no reference audio — served via vLLM-Omni through the OpenAI /v1/audio/speech API (24 kHz mono).
Guide
Overview
MOSS-SoundEffect is the
sound-effect generation member of OpenMOSS's MOSS-TTS Family, served through
vLLM-Omni with the OpenAI-compatible /v1/audio/speech API. It generates
audio for natural environments, urban scenes, biological sounds, human actions,
and musical fragments — suitable for film, games, and interactive experiences.
Output is 24 kHz mono.
Unlike the speech members of the family, MOSS-SoundEffect takes no reference
audio: the input field is a text description of the sound (mapped to the
upstream ambient_sound parameter). Generation runs at ~12.5 tokens/second;
longer descriptions (or a higher token budget) produce longer audio.
Prerequisites
- Hardware: a single CUDA GPU comparable to the 8B MOSS-TTS profile (~18 GB talker + ~8 GB codec).
- vLLM-Omni targeting vLLM >= 0.22.
Installation
uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git
Launch the server
MOSS-SoundEffect shares model_type=moss_tts_delay, so pass its deploy config
explicitly:
vllm serve OpenMOSS-Team/MOSS-SoundEffect --omni \
--deploy-config vllm_omni/deploy/moss_sound_effect.yaml --port 8000
Client usage
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "OpenMOSS-Team/MOSS-SoundEffect",
"input": "Thunder rumbling, rain pattering on a tin roof.",
"response_format": "wav"
}' --output thunder.wav
Known limitations
- Output is 24 kHz mono.
- No
ref_audiois accepted;inputis a text description of the sound.