vLLM/Recipes
inclusionAI

inclusionAI/Ming-omni-tts-0.5B

inclusionAI's dense 0.5B-LM text-to-speech model served via vLLM-Omni with style, dialect, voice-cloning and multi-speaker controls, through the OpenAI /v1/audio/speech API (44.1 kHz mono).

Guide

Overview

Ming-omni-tts-0.5B is inclusionAI's dense text-to-speech model served through vLLM-Omni with the OpenAI-compatible /v1/audio/speech API. It exposes style, dialect, voice cloning, and multi-speaker controls and outputs 44.1 kHz mono audio. The 0.5B refers to the LM backbone; the full checkpoint is ~1.6B parameters including the audio encoder and vocoder.

Prerequisites

  • Hardware: a single GPU. The reference end-to-end run was on AMD gfx942 (ROCm 7.x) using the vllm/vllm-omni-rocm:v0.22.0 image; the NVIDIA CUDA path uses the same vllm serve command shown below.
  • vLLM-Omni targeting vLLM >= 0.22.
  • Ming's model_type falls through deploy-config auto-detection, so the launch command passes --deploy-config vllm_omni/deploy/ming_tts.yaml explicitly (the file ships in the vLLM-Omni repo/package).
  • The tested environment uses --enforce-eager.

Installation

uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git

Launch the server

NVIDIA (CUDA)

vllm serve inclusionAI/Ming-omni-tts-0.5B \
  --deploy-config vllm_omni/deploy/ming_tts.yaml \
  --omni --enforce-eager --port 8000

AMD (ROCm) — reference-tested via the official image

docker run --rm \
  --group-add=video --ipc=host --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$PWD":/app/vllm-omni -w /app/vllm-omni \
  -e VLLM_ROCM_USE_AITER=0 \
  -p 8000:8000 \
  vllm/vllm-omni-rocm:v0.22.0 \
  --model inclusionAI/Ming-omni-tts-0.5B \
  --deploy-config vllm_omni/deploy/ming_tts.yaml \
  --omni --port 8000 --enforce-eager

Client usage

Dialect control with a reference clip (e.g. Cantonese / 广粤话). --ref-audio matches upstream use_spk_emb=True; do not add --ref-text for the dialect case:

python examples/online_serving/text_to_speech/ming_tts/openai_speech_client.py \
  --text "我觉得社会企业同个人都有责任" \
  --instruction-json '{"方言":"广粤话"}' \
  --ref-audio /path/to/yue_prompt.wav \
  --max-new-tokens 200 \
  --output dialect.wav

Known limitations

  • Output is 44.1 kHz mono.
  • The reference validation in the source recipe was on AMD gfx942 (ROCm); the CUDA command above is the equivalent path.

References