vLLM/Recipes
OpenMOSS

OpenMOSS-Team/MOSS-SoundEffect

OpenMOSS's 8B sound-effect generation model — environmental, urban, biological, human-action and musical sounds with controllable duration, no reference audio — served via vLLM-Omni through the OpenAI /v1/audio/speech API (24 kHz mono).

Guide

Overview

MOSS-SoundEffect is the sound-effect generation member of OpenMOSS's MOSS-TTS Family, served through vLLM-Omni with the OpenAI-compatible /v1/audio/speech API. It generates audio for natural environments, urban scenes, biological sounds, human actions, and musical fragments — suitable for film, games, and interactive experiences. Output is 24 kHz mono.

Unlike the speech members of the family, MOSS-SoundEffect takes no reference audio: the input field is a text description of the sound (mapped to the upstream ambient_sound parameter). Generation runs at ~12.5 tokens/second; longer descriptions (or a higher token budget) produce longer audio.

Prerequisites

  • Hardware: a single CUDA GPU comparable to the 8B MOSS-TTS profile (~18 GB talker + ~8 GB codec).
  • vLLM-Omni targeting vLLM >= 0.22.

Installation

uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git

Launch the server

MOSS-SoundEffect shares model_type=moss_tts_delay, so pass its deploy config explicitly:

vllm serve OpenMOSS-Team/MOSS-SoundEffect --omni \
  --deploy-config vllm_omni/deploy/moss_sound_effect.yaml --port 8000

Client usage

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "OpenMOSS-Team/MOSS-SoundEffect",
    "input": "Thunder rumbling, rain pattering on a tin roof.",
    "response_format": "wav"
  }' --output thunder.wav

Known limitations

  • Output is 24 kHz mono.
  • No ref_audio is accepted; input is a text description of the sound.

References