openbmb/VoxCPM2

OpenBMB's 2B native-AR text-to-speech model served via vLLM-Omni — 48 kHz mono, 30+ languages, zero-shot synthesis and reference-audio voice cloning — through the OpenAI /v1/audio/speech API.

View on HuggingFace

dense2B0 ctxvLLM 0.21.0+vLLM-Omninightlyomni

Guide

Overview

VoxCPM2 is OpenBMB's 2B-parameter native-AR text-to-speech model, served through vLLM-Omni with the OpenAI-compatible /v1/audio/speech API. It emits 48 kHz mono audio, supports 30+ languages, and runs both zero-shot synthesis and reference-audio voice cloning. The pipeline is single-stage (MiniCPM4 base LM → FSQ → MiniCPM4 residual LM → LocDiT CFM solver → AudioVAE), so a single 24 GB consumer GPU is sufficient.

Prerequisites

Hardware: one CUDA GPU with ≥ 24 GB VRAM. Reference run: 1x RTX 4090 24 GB — ~4.9 GiB weights + ~15.2 GiB KV cache + ~2 GiB talker buffers ≈ 22 GiB / 24 GiB with the default voxcpm2.yaml (gpu_memory_utilization: 0.9, max_num_seqs: 4, enforce_eager: true).
vLLM-Omni targeting vLLM >= 0.21.
voxcpm >= 2.0, soundfile, httpx, and ninja on PATH.

Installation

uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git
uv pip install voxcpm soundfile httpx ninja

Launch the server

The deploy config (vllm_omni/deploy/voxcpm2.yaml) is auto-discovered from model_type=voxcpm2 (no --trust-remote-code needed):

vllm serve openbmb/VoxCPM2 --omni --host 0.0.0.0 --port 8000

On a shared GPU, pass --gpu-memory-utilization 0.75 (or lower) if startup fails the free-memory check. Cold start is ~60 s (subprocess fork + vLLM init

model load + flashinfer JIT + torch.compile + the talker's own CUDA-Graph capture over the CFM solver / AudioVAE).

Client usage

Zero-shot synthesis (curl, WAV)

curl -X POST http://127.0.0.1:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "openbmb/VoxCPM2", "input": "Hello, this is VoxCPM2.", "voice": "default", "response_format": "wav"}' \
  --output output.wav

Voice cloning from a reference WAV (Python client)

python examples/online_serving/text_to_speech/voxcpm2/openai_speech_client.py \
  --text "This should sound like the reference speaker." \
  --ref-audio /path/to/reference.wav \
  --api-base http://127.0.0.1:8000 \
  --output cloned.wav

--ref-audio accepts local paths (auto-base64), HTTP(S) URLs, or data:audio/wav;base64,... URIs. The OpenAI voice field is required by the schema but ignored unless it names an uploaded voice — cloning is driven entirely by ref_audio.

Streaming PCM (48 kHz s16le)

curl -X POST http://127.0.0.1:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "openbmb/VoxCPM2", "input": "Streaming PCM test.", "voice": "default", "stream": true, "response_format": "pcm"}' \
  --no-buffer | play -t raw -r 48000 -e signed -b 16 -c 1 -

The player sample rate is 48 kHz (-r 48000), not 24 kHz.

Performance

After warmup, steady-state inference RTF is ~0.12 (~8x real-time on a single RTX 4090). The first 1–2 requests pay a one-time warmup cost (torch.compile + CUDA-Graph capture) that amortizes to zero; keep a server warm for interactive use. Voice cloning adds only ~+0.017 RTF over zero-shot (the per-request AudioVAE.encode over the reference clip).

Known limitations

Output is 48 kHz mono.
The default config sets enforce_eager: true; vLLM engine-level CUDA graphs are off by design (the talker captures its own CFM/AudioVAE graph instead).