inclusionAI/Ring-2.6-1T
Ring-2.6-1T (BailingMoeV2_5) FP8 thinking model with 1T total / 50B active params, hybrid linear + MLA attention, 128K context
View on HuggingFaceGuide
Overview
Ring-2.6-1T is inclusionAI's
BailingMoeV2_5 FP8 trillion-scale thinking model (1T total / 50B active
parameters) with hybrid linear + MLA attention and a 128K context window.
It is the reasoning-focused counterpart to Ling-2.6-1T
in the Ring 2.6 series and emits explicit <think>...</think> traces; the
chat template accepts a reasoning_effort field (default high).
vLLM 0.20.1 shipped the BailingMoeV2.5 MLA RoPE fix (vllm-project/vllm#41185) that is load-bearing for this architecture — pin v0.20.2 or newer.
Deployment Configurations
Docker (AMD MI300X / MI325X / MI355X, TP=8)
TP=8 fits the model-derived 128K context on an MI300X-class node; MI325X and MI355X have larger per-GPU HBM and more headroom.
docker run --rm -it \
--cap-add=SYS_PTRACE \
--ipc=host \
--privileged=true \
--shm-size=128GB \
--network=host \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
-e VLLM_ROCM_USE_AITER=1 \
vllm/vllm-openai-rocm:v0.20.2 \
inclusionAI/Ring-2.6-1T \
--tensor-parallel-size 8 \
--trust-remote-code
Pip (NVIDIA B300, TP=8)
uv venv && source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
vllm serve inclusionAI/Ring-2.6-1T \
--trust-remote-code \
--tensor-parallel-size 8
Client Usage
Text Generation
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="inclusionAI/Ring-2.6-1T",
messages=[{"role": "user", "content": "Prove there are infinitely many primes."}],
max_tokens=4096,
temperature=0.6,
extra_body={"chat_template_kwargs": {"reasoning_effort": "high"}},
)
print(response.choices[0].message.content)
Set reasoning_effort to low / medium / high to control the depth of
thinking. The model wraps its reasoning trace in <think>...</think> before
emitting the final answer.