ByteDance-Seed/Seed-OSS-36B-Instruct
ByteDance Seed-OSS 36B dense model with unique 'thinking budget' control and 512K context support
View on HuggingFaceGuide
Overview
Seed-OSS-36B is a dense language model from ByteDance Seed with a unique thinking budget feature for controlled reasoning, and up to 512K context. Users can choose tensor-parallel (low latency) or data-parallel (high throughput).
Prerequisites
- Hardware: 8x GPU recommended (TP=8); also runs on consumer hardware like RTX 3090
- vLLM >= 0.11.0 (support may require main branch)
- Latest transformers for compatibility
Install vLLM (NVIDIA)
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
uv pip install git+https://github.com/huggingface/transformers.git@56d68c6706ee052b445e1e476056ed92ac5eb383
Install vLLM (AMD ROCm)
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/
Launch commands
NVIDIA:
vllm serve ByteDance-Seed/Seed-OSS-36B-Instruct \
--host localhost --port 8000 \
--tensor-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser seed_oss
AMD:
export VLLM_ROCM_USE_AITER=1
vllm serve ByteDance-Seed/Seed-OSS-36B-Instruct \
--tensor-parallel-size 8 \
--enable-auto-tool-choice --tool-call-parser seed_oss \
--trust-remote-code
Tuning:
--max-model-len=65536works well; max is 512K.--max-num-batched-tokens=32768for prompt-heavy; reduce to 8K–16K for latency.--gpu-memory-utilization=0.95to maximize KV cache.
Thinking Budget
Control the model's chain-of-thought length via chat_template_kwargs. Recommended
values are multiples of 512 (512, 1K, 2K, 4K, 8K, 16K). Use 0 for direct answers.
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Janet's ducks lay 16 eggs per day..."},
],
extra_body={"chat_template_kwargs": {"thinking_budget": 512}},
)
print(response.choices[0].message.content)
curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ByteDance-Seed/Seed-OSS-36B-Instruct",
"messages": [{"role": "user", "content": "Explain quantum computing"}],
"chat_template_kwargs": {"thinking_budget": 512}
}'
The model emits <seed:think> blocks with <seed:cot_budget_reflect> markers that
report token usage against the budget.
Benchmarking
vllm bench serve \
--backend vllm --model ByteDance-Seed/Seed-OSS-36B-Instruct \
--endpoint /v1/completions --host localhost --port 8000 \
--dataset-name random --random-input 800 --random-output 100 \
--request-rate 2 --num-prompt 100