inclusionAI/Ring-1T-FP8
Ring-1T (BailingMoeV2) FP8 model (~1T total params) for 8xH200 or 8xMI300X deployment
View on HuggingFaceGuide
Overview
Ring-1T-FP8 is inclusionAI's BailingMoeV2 FP8 model (~1T total parameters). This recipe covers pure tensor-parallel deployment across 8 GPUs on NVIDIA H200 or AMD MI300X+.
Prerequisites
- Hardware: 8x H200 or 8x MI300X/MI325X/MI355X
- vLLM >= 0.11.0
Install vLLM (CUDA)
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Install vLLM (AMD ROCm)
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.14.1/rocm700
Launch commands
8x H200 (FP8 KV cache):
vllm serve inclusionAI/Ring-1T-FP8 \
--trust-remote-code \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.97 \
--max_num_seqs 32 \
--kv-cache-dtype fp8 \
--compilation-config '{"use_inductor": false}' \
--served-model-name Ring-1T-FP8
8x MI300X/MI325X/MI355X:
export VLLM_ROCM_USE_AITER=1
vllm serve inclusionAI/Ring-1T-FP8 \
--trust-remote-code \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.9 \
--max_num_seqs 32 \
--kv-cache-dtype fp8 \
--served-model-name Ring-1T-FP8
Tuning flags:
--max-model-len=65536works well for most scenarios.--max-num-batched-tokens=32768for prompt-heavy; 16384/8192 for lower latency.- Reduce
--gpu-memory-utilizationbelow 0.97 if you hit OOM.
Client Usage
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Ring-1T-FP8",
"messages": [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
}'