vLLM/Recipes
inclusionAI

inclusionAI/Ring-1T-FP8

Ring-1T (BailingMoeV2) FP8 model (~1T total params) for 8xH200 or 8xMI300X deployment

View on HuggingFace
moe1T / 50B65,536 ctxvLLM 0.11.0+text
Guide

Overview

Ring-1T-FP8 is inclusionAI's BailingMoeV2 FP8 model (~1T total parameters). This recipe covers pure tensor-parallel deployment across 8 GPUs on NVIDIA H200 or AMD MI300X+.

Prerequisites

  • Hardware: 8x H200 or 8x MI300X/MI325X/MI355X
  • vLLM >= 0.11.0

Install vLLM (CUDA)

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Install vLLM (AMD ROCm)

uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.14.1/rocm700

Launch commands

8x H200 (FP8 KV cache):

vllm serve inclusionAI/Ring-1T-FP8 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.97 \
  --max_num_seqs 32 \
  --kv-cache-dtype fp8 \
  --compilation-config '{"use_inductor": false}' \
  --served-model-name Ring-1T-FP8

8x MI300X/MI325X/MI355X:

export VLLM_ROCM_USE_AITER=1
vllm serve inclusionAI/Ring-1T-FP8 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.9 \
  --max_num_seqs 32 \
  --kv-cache-dtype fp8 \
  --served-model-name Ring-1T-FP8

Tuning flags:

  • --max-model-len=65536 works well for most scenarios.
  • --max-num-batched-tokens=32768 for prompt-heavy; 16384/8192 for lower latency.
  • Reduce --gpu-memory-utilization below 0.97 if you hit OOM.

Client Usage

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Ring-1T-FP8",
    "messages": [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
  }'

References