fishaudio/s2-pro

Fish Audio's dual-AR text-to-speech model served via vLLM-Omni, producing 44.1 kHz mono audio with optional voice cloning, through the OpenAI /v1/audio/speech API.

View on HuggingFace

dense4.6B0 ctxvLLM 0.19.0+vLLM-Omninightlyomni

Guide

Overview

Fish Speech S2 Pro is Fish Audio's dual-AR text-to-speech model, served through vLLM-Omni with the OpenAI-compatible /v1/audio/speech API. It produces 44.1 kHz mono audio and supports zero-shot voice cloning from a reference clip + its transcript.

Prerequisites

Hardware: a single CUDA GPU. Reference run: 1x A800 80 GB — the model loads at ~48.3 GiB and peaks ~48.9 GiB during inference. A two-GPU profile (fish_qwen3_omni_2gpu.yaml) is available for higher concurrency.
vLLM-Omni targeting vLLM >= 0.19.

fish-speech for the DAC codec. It depends on pyaudio, which needs PortAudio system libraries:

# Ubuntu/Debian
sudo apt-get update && sudo apt-get install -y libportaudio2 portaudio19-dev

Installation

uv venv && source .venv/bin/activate
uv pip install git+https://github.com/vllm-project/vllm-omni.git
uv pip install fish-speech

Kvcache attention fast path

S2 Pro uses a Triton decode-only kvcache-attention fast path by default on CUDA builds. Toggle it with VLLM_OMNI_FISH_KVCACHE_ATTN:

# Verify availability
python -c "from vllm_omni.attention import fish_kvcache_attn; print(fish_kvcache_attn.is_available()); print(fish_kvcache_attn.load_error())"

export VLLM_OMNI_FISH_KVCACHE_ATTN=0          # disable the fast path
# VLLM_OMNI_FISH_KVCACHE_ATTN=required        # fail fast if it can't load

Launch the server

The deploy config (vllm_omni/deploy/fish_qwen3_omni.yaml) is auto-discovered from model_type=fish_qwen3_omni:

# Single GPU
vllm serve fishaudio/s2-pro --omni --port 8000

# Two GPUs (Stage0 Slow/Fast AR on GPU 0, Stage1 DAC decoder on GPU 1)
CUDA_VISIBLE_DEVICES=0,1 vllm serve fishaudio/s2-pro --omni --port 8000 \
  --deploy-config vllm_omni/deploy/fish_qwen3_omni_2gpu.yaml

Client usage

Basic TTS

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "fishaudio/s2-pro", "input": "Hello, how are you?", "voice": "default", "response_format": "wav"}' \
  --output output.wav

Voice cloning (reference audio + transcript)

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "fishaudio/s2-pro",
    "input": "Hello, this is a cloned voice.",
    "voice": "default",
    "ref_audio": "https://example.com/reference.wav",
    "ref_text": "Transcript of the reference audio."
  }' --output cloned.wav

Known limitations

Output is 44.1 kHz mono WAV.
Voice cloning requires both ref_audio and ref_text.

References

Model card
vLLM-Omni Fish Speech example
Benchmarks: vllm-project/vllm-omni #2515 (H100x2), #3323 (H20x2)