bosonai/higgs-audio-v3-tts-4b

Boson AI's ~4B Qwen3-backbone text-to-speech model served via vLLM-Omni — 24 kHz speech, 100+ languages, zero-shot voice cloning with inline emotion/style/prosody control tokens — through the OpenAI /v1/audio/speech API.

View on HuggingFace

dense4B0 ctxvLLM 0.22.0+vLLM-Omninightlyomni

Guide

Overview

Higgs Audio V3 TTS is Boson AI's multilingual text-to-speech model served through vLLM-Omni with the OpenAI-compatible /v1/audio/speech API. It generates 24 kHz speech, supports zero-shot voice cloning from a reference clip, and handles 100+ languages with inline control tokens for emotion, style, and prosody. The architecture is a ~4B Qwen3 backbone with a fused multi-codebook embedding/head (8 codebooks × 1026 vocab, MusicGen-style delay pattern).

Prerequisites

Hardware: a single CUDA GPU. Reference run: 1x H100 80 GB — Stage 0 (talker, ~4B) ~60% GPU memory, Stage 1 (codec decoder) ~25%.
vLLM-Omni targeting vLLM >= 0.22.
--trust-remote-code is required.

Installation

uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git

Launch the server

The deploy config (vllm_omni/deploy/higgs_multimodal_qwen3.yaml) is auto-discovered from model_type=higgs_multimodal_qwen3:

vllm serve bosonai/higgs-audio-v3-tts-4b --omni --trust-remote-code \
  --host 0.0.0.0 --port 8000

Client usage

Basic TTS (curl)

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "bosonai/higgs-audio-v3-tts-4b", "input": "Hello, how are you?"}' \
  --output hello.wav

Voice clone (Python client)

python examples/online_serving/text_to_speech/higgs_audio_v3/batch_speech_client.py \
  --base-url http://localhost:8000 \
  --model bosonai/higgs-audio-v3-tts-4b \
  --ref-audio path/to/reference.wav \
  --ref-text "Transcript of the reference." \
  --prompts "Text to clone."

Offline batch inference

python examples/offline_inference/text_to_speech/higgs_audio_v3/end2end.py \
  --texts "Hello world." "The quick brown fox jumps over the lazy dog." \
  --output-dir results/higgs_v3_wavs

ref_audio accepts WAV/FLAC/MP3; ref_text is optional but improves fidelity.

Known limitations

Output is 24 kHz mono WAV.
The deploy config runs both stages with enforce_eager: true (max_num_seqs=16). A CUDA graph is available for Stage 0 and helps batch=1 (~-13% RTF) but not batch>1 (the model-owned sampler runs outside the graph); Stage 1 (code2wav) must stay eager (@torch.inference_mode is incompatible with graph capture).
Async (chunk-based) streaming is not yet implemented; the pipeline is sync-only.