vLLM/Recipes
inclusionAI

inclusionAI/Ring-2.6-1T

Ring-2.6-1T (BailingMoeV2_5) FP8 thinking model with 1T total / 50B active params, hybrid linear + MLA attention, 128K context

View on HuggingFace
moe1T / 50B131,072 ctxvLLM 0.20.2+text
Guide

Overview

Ring-2.6-1T is inclusionAI's BailingMoeV2_5 FP8 trillion-scale thinking model (1T total / 50B active parameters) with hybrid linear + MLA attention and a 128K context window. It is the reasoning-focused counterpart to Ling-2.6-1T in the Ring 2.6 series and emits explicit <think>...</think> traces; the chat template accepts a reasoning_effort field (default high).

vLLM 0.20.1 shipped the BailingMoeV2.5 MLA RoPE fix (vllm-project/vllm#41185) that is load-bearing for this architecture — pin v0.20.2 or newer.

Deployment Configurations

Docker (AMD MI300X / MI325X / MI355X, TP=8)

TP=8 fits the model-derived 128K context on an MI300X-class node; MI325X and MI355X have larger per-GPU HBM and more headroom.

docker run --rm -it \
  --cap-add=SYS_PTRACE \
  --ipc=host \
  --privileged=true \
  --shm-size=128GB \
  --network=host \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  -e VLLM_ROCM_USE_AITER=1 \
  vllm/vllm-openai-rocm:v0.20.2 \
    inclusionAI/Ring-2.6-1T \
    --tensor-parallel-size 8 \
    --trust-remote-code

Pip (NVIDIA B300, TP=8)

uv venv && source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

vllm serve inclusionAI/Ring-2.6-1T \
  --trust-remote-code \
  --tensor-parallel-size 8

Client Usage

Text Generation

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="inclusionAI/Ring-2.6-1T",
    messages=[{"role": "user", "content": "Prove there are infinitely many primes."}],
    max_tokens=4096,
    temperature=0.6,
    extra_body={"chat_template_kwargs": {"reasoning_effort": "high"}},
)
print(response.choices[0].message.content)

Set reasoning_effort to low / medium / high to control the depth of thinking. The model wraps its reasoning trace in <think>...</think> before emitting the final answer.

References