deepseek-ai/DeepSeek-R1

DeepSeek-R1 is a 671B-parameter MoE reasoning model built on the DeepSeek-V3 architecture, trained with large-scale reinforcement learning for strong chain-of-thought capabilities.

Open-weights RL-trained reasoning model with native FP8 / FP4 variants

View on HuggingFace

moe671B / 37B163,840 ctxvLLM 0.12.0+text

Guide

Overview

DeepSeek-R1 is a 671B-parameter Mixture-of-Experts reasoning model (37B activated per token) that shares its architecture with DeepSeek-V3, so the same launch recipes apply to both. DeepSeek publishes a refreshed checkpoint as DeepSeek-R1-0528, and NVIDIA publishes an FP4 quantized variant (nvidia/DeepSeek-R1-FP4) that runs on Blackwell GPUs with fewer devices.

Prerequisites

Hardware (FP8): 8x H200 (CUDA) or 8x MI300X / MI325X / MI355X (ROCm)
Hardware (FP4): 4x B200 GPUs
vLLM: Current stable release

CUDA

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

ROCm (MI300X, MI325X, MI355X)

Requires Python 3.12, ROCm 7.2.1, and glibc >= 2.35.

uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/

Client Usage

8xH200 / 8xMI300X (FP8)

Tensor Parallel + Expert Parallel (TP8+EP) — CUDA:

vllm serve deepseek-ai/DeepSeek-R1-0528 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --enable-expert-parallel

Tensor Parallel + Expert Parallel (TP8+EP) — ROCm:

export SAFETENSORS_FAST_GPU=1
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1

vllm serve deepseek-ai/DeepSeek-R1 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --enable-expert-parallel

Data Parallel + Expert Parallel (DP8+EP) — CUDA:

vllm serve deepseek-ai/DeepSeek-R1-0528 \
  --trust-remote-code \
  --data-parallel-size 8 \
  --enable-expert-parallel

4xB200 (FP4)

Enable FlashInfer MoE kernels before launching:

# For FP4 (recommended on Blackwell)
export VLLM_USE_FLASHINFER_MOE_FP4=1
# For FP8 on Blackwell
export VLLM_USE_FLASHINFER_MOE_FP8=1

Tensor Parallel + Expert Parallel (TP4+EP):

CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve nvidia/DeepSeek-R1-FP4 \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --enable-expert-parallel

Data Parallel + Expert Parallel (DP4+EP):

CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve nvidia/DeepSeek-R1-FP4 \
  --trust-remote-code \
  --data-parallel-size 4 \
  --enable-expert-parallel

Benchmarking

For benchmarking, prefix caching is disabled by default in vLLM — no extra server flag is needed.

# FP8 benchmark
vllm bench serve \
  --model deepseek-ai/DeepSeek-R1-0528 \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --request-rate 10000 \
  --num-prompts 16 \
  --ignore-eos

# FP4 benchmark
vllm bench serve \
  --model nvidia/DeepSeek-R1-FP4 \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --request-rate 10000 \
  --num-prompts 16 \
  --ignore-eos

Test different workloads by adjusting input/output lengths:

Prompt-heavy: 8000 input / 1000 output
Decode-heavy: 1000 input / 8000 output
Balanced: 1000 input / 1000 output

Troubleshooting

Disaggregated Serving with Wide EP (Experimental GB200): See vLLM issue #33583, the vLLM blog post, and this reference fork for GB200 disaggregated serving recipes.