openai/gpt-oss-120b
OpenAI's gpt-oss family (20B / 120B) with MXFP4 MoE, attention-sinks, built-in tools via Responses API
View on HuggingFaceGuide
Overview
gpt-oss-20b and gpt-oss-120b are open-weight reasoning models from OpenAI. vLLM
supports NVIDIA H100/H200/B200, AMD MI300X/MI325X/MI355X, and Radeon AI PRO R9700,
with ongoing work for Ampere/Ada/RTX 5090.
Optimizations:
- Flexible parallelism (TP 2/4/8)
- Attention kernels for attention-sinks and sliding-window shapes
- Asynchronous scheduling for CPU/GPU overlap
Prerequisites
- Hardware: NVIDIA H100/H200/B200 (or A100 80GB for single-GPU), AMD MI300+
- vLLM >= 0.10.0
- CUDA >= 12.8 if building from source (must match between install and serving)
Install vLLM
uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
Docker quickstart:
docker run --gpus all -p 8000:8000 --ipc=host vllm/vllm-openai --model openai/gpt-oss-20b
AMD ROCm wheels:
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm
Launch commands
A100 (single card, default TRITON_ATTN + Marlin MXFP4 MoE):
vllm serve openai/gpt-oss-120b
vllm serve openai/gpt-oss-120b --tensor-parallel-size 4
Blackwell (B200) with FlashInfer MXFP4+MXFP8 MoE:
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
# GPT-OSS_Blackwell.yaml
# kv-cache-dtype: fp8
# no-enable-prefix-caching: true
# max-cudagraph-capture-size: 2048
# max-num-batched-tokens: 8192
# stream-interval: 20
vllm serve openai/gpt-oss-120b --config GPT-OSS_Blackwell.yaml --tensor-parallel-size 1
Hopper (H100/H200): same as Blackwell without kv-cache-dtype and without the env var.
AMD MI300X/MI325X:
export HSA_NO_SCRATCH_RECLAIM=1
export AMDGCN_USE_BUFFER_OPS=0
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4
vllm serve openai/gpt-oss-120b \
--tensor-parallel-size 8 \
--attention-backend ROCM_AITER_UNIFIED_ATTN \
-cc.pass_config.fuse_rope_kvcache=True \
-cc.use_inductor_graph_partition=True \
--gpu-memory-utilization 0.95 \
--block-size 64
Tool Use
/v1/responses endpoint supports built-in tools (browsing, python, MCP). Setup
requires uv pip install gpt-oss and either docker (for Python sandbox) or
PYTHON_EXECUTION_BACKEND=dangerously_use_uv. For demo tools:
vllm serve ... --tool-server demo
For user-defined function calling:
vllm serve ... --tool-call-parser openai --enable-auto-tool-choice
Client Usage
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": "Explain sinks attention."}],
)
print(response.choices[0].message.content)
Accuracy Evaluation
OpenAI recommends evaluating with the gpt-oss reference library:
vllm serve openai/gpt-oss-120b \
--tensor_parallel_size 8 --max-model-len 131072 \
--max-num-batched-tokens 10240 --max-num-seqs 128 \
--gpu-memory-utilization 0.85 --no-enable-prefix-caching
mkdir -p /tmp/gpqa_openai
OPENAI_API_KEY=empty python -m gpt_oss.evals \
--model openai/gpt-oss-120b --eval gpqa --n-threads 128
Reproduced scores (120B): Low 65.3 / 51.2; Mid 72.4 / 79.6; High 79.4 / 93.0 (GPQA / AIME25).
Troubleshooting
- Attention sinks dtype error on Blackwell: ensure env vars above are set.
tl.language not defined: make sure no extra Triton (e.g., pytorch-triton) is installed.- H100 TP1 OOM:
--gpu-memory-utilization 0.95 --max-num-batched-tokens 1024. - Harmony vocab download failure: pre-download tiktoken files and set
TIKTOKEN_ENCODINGS_BASE.
Known Limitations
- Responses API: streaming is basic, annotations/citations unsupported, usage accounting returns zeros.
- Function calling currently supports only
tool_choice="auto".