inclusionAI/Ming-omni-tts-0.5B
inclusionAI's dense 0.5B-LM text-to-speech model served via vLLM-Omni with style, dialect, voice-cloning and multi-speaker controls, through the OpenAI /v1/audio/speech API (44.1 kHz mono).
Guide
Overview
Ming-omni-tts-0.5B is
inclusionAI's dense text-to-speech model served through vLLM-Omni with the
OpenAI-compatible /v1/audio/speech API. It exposes style, dialect, voice
cloning, and multi-speaker controls and outputs 44.1 kHz mono audio. The
0.5B refers to the LM backbone; the full checkpoint is ~1.6B parameters
including the audio encoder and vocoder.
Prerequisites
- Hardware: a single GPU. The reference end-to-end run was on AMD
gfx942(ROCm 7.x) using thevllm/vllm-omni-rocm:v0.22.0image; the NVIDIA CUDA path uses the samevllm servecommand shown below. - vLLM-Omni targeting vLLM >= 0.22.
- Ming's
model_typefalls through deploy-config auto-detection, so the launch command passes--deploy-config vllm_omni/deploy/ming_tts.yamlexplicitly (the file ships in the vLLM-Omni repo/package). - The tested environment uses
--enforce-eager.
Installation
uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git
Launch the server
NVIDIA (CUDA)
vllm serve inclusionAI/Ming-omni-tts-0.5B \
--deploy-config vllm_omni/deploy/ming_tts.yaml \
--omni --enforce-eager --port 8000
AMD (ROCm) — reference-tested via the official image
docker run --rm \
--group-add=video --ipc=host --cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd --device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v "$PWD":/app/vllm-omni -w /app/vllm-omni \
-e VLLM_ROCM_USE_AITER=0 \
-p 8000:8000 \
vllm/vllm-omni-rocm:v0.22.0 \
--model inclusionAI/Ming-omni-tts-0.5B \
--deploy-config vllm_omni/deploy/ming_tts.yaml \
--omni --port 8000 --enforce-eager
Client usage
Dialect control with a reference clip (e.g. Cantonese / 广粤话). --ref-audio
matches upstream use_spk_emb=True; do not add --ref-text for the
dialect case:
python examples/online_serving/text_to_speech/ming_tts/openai_speech_client.py \
--text "我觉得社会企业同个人都有责任" \
--instruction-json '{"方言":"广粤话"}' \
--ref-audio /path/to/yue_prompt.wav \
--max-new-tokens 200 \
--output dialect.wav
Known limitations
- Output is 44.1 kHz mono.
- The reference validation in the source recipe was on AMD
gfx942(ROCm); the CUDA command above is the equivalent path.