fishaudio/s2-pro
Fish Audio's dual-AR text-to-speech model served via vLLM-Omni, producing 44.1 kHz mono audio with optional voice cloning, through the OpenAI /v1/audio/speech API.
Guide
Overview
Fish Speech S2 Pro is Fish Audio's
dual-AR text-to-speech model, served through vLLM-Omni with the
OpenAI-compatible /v1/audio/speech API. It produces 44.1 kHz mono audio
and supports zero-shot voice cloning from a reference clip + its transcript.
Prerequisites
-
Hardware: a single CUDA GPU. Reference run: 1x A800 80 GB — the model loads at ~48.3 GiB and peaks ~48.9 GiB during inference. A two-GPU profile (
fish_qwen3_omni_2gpu.yaml) is available for higher concurrency. -
vLLM-Omni targeting vLLM >= 0.19.
-
fish-speechfor the DAC codec. It depends onpyaudio, which needs PortAudio system libraries:# Ubuntu/Debian sudo apt-get update && sudo apt-get install -y libportaudio2 portaudio19-dev
Installation
uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git
uv pip install fish-speech
Kvcache attention fast path
S2 Pro uses a Triton decode-only kvcache-attention fast path by default on
CUDA builds. Toggle it with VLLM_OMNI_FISH_KVCACHE_ATTN:
# Verify availability
python -c "from vllm_omni.attention import fish_kvcache_attn; print(fish_kvcache_attn.is_available()); print(fish_kvcache_attn.load_error())"
export VLLM_OMNI_FISH_KVCACHE_ATTN=0 # disable the fast path
# VLLM_OMNI_FISH_KVCACHE_ATTN=required # fail fast if it can't load
Launch the server
The deploy config (vllm_omni/deploy/fish_qwen3_omni.yaml) is auto-discovered
from model_type=fish_qwen3_omni:
# Single GPU
vllm serve fishaudio/s2-pro --omni --port 8000
# Two GPUs (Stage0 Slow/Fast AR on GPU 0, Stage1 DAC decoder on GPU 1)
CUDA_VISIBLE_DEVICES=0,1 vllm serve fishaudio/s2-pro --omni --port 8000 \
--deploy-config vllm_omni/deploy/fish_qwen3_omni_2gpu.yaml
Client usage
Basic TTS
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "fishaudio/s2-pro", "input": "Hello, how are you?", "voice": "default", "response_format": "wav"}' \
--output output.wav
Voice cloning (reference audio + transcript)
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "fishaudio/s2-pro",
"input": "Hello, this is a cloned voice.",
"voice": "default",
"ref_audio": "https://example.com/reference.wav",
"ref_text": "Transcript of the reference audio."
}' --output cloned.wav
Known limitations
- Output is 44.1 kHz mono WAV.
- Voice cloning requires both
ref_audioandref_text.
References
- Model card
- vLLM-Omni Fish Speech example
- Benchmarks: vllm-project/vllm-omni #2515 (H100x2), #3323 (H20x2)