deepseek-ai/DeepSeek-R1
DeepSeek-R1 is a 671B-parameter MoE reasoning model built on the DeepSeek-V3 architecture, trained with large-scale reinforcement learning for strong chain-of-thought capabilities.
View on HuggingFaceGuide
Overview
DeepSeek-R1 is a 671B-parameter Mixture-of-Experts reasoning model (37B activated per
token) that shares its architecture with DeepSeek-V3, so the same launch recipes apply
to both. DeepSeek publishes a refreshed checkpoint as DeepSeek-R1-0528, and NVIDIA
publishes an FP4 quantized variant (nvidia/DeepSeek-R1-FP4) that runs on Blackwell
GPUs with fewer devices.
Prerequisites
- Hardware (FP8): 8x H200 GPUs (verified)
- Hardware (FP4): 4x B200 GPUs
- vLLM: Install with
uv pip install -U vllm --torch-backend auto
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Client Usage
8xH200 (FP8)
Tensor Parallel + Expert Parallel (TP8+EP):
vllm serve deepseek-ai/DeepSeek-R1-0528 \
--trust-remote-code \
--tensor-parallel-size 8 \
--enable-expert-parallel
Data Parallel + Expert Parallel (DP8+EP):
vllm serve deepseek-ai/DeepSeek-R1-0528 \
--trust-remote-code \
--data-parallel-size 8 \
--enable-expert-parallel
4xB200 (FP4)
Enable FlashInfer MoE kernels before launching:
# For FP4 (recommended on Blackwell)
export VLLM_USE_FLASHINFER_MOE_FP4=1
# For FP8 on Blackwell
export VLLM_USE_FLASHINFER_MOE_FP8=1
Tensor Parallel + Expert Parallel (TP4+EP):
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve nvidia/DeepSeek-R1-FP4 \
--trust-remote-code \
--tensor-parallel-size 4 \
--enable-expert-parallel
Data Parallel + Expert Parallel (DP4+EP):
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve nvidia/DeepSeek-R1-FP4 \
--trust-remote-code \
--data-parallel-size 4 \
--enable-expert-parallel
Benchmarking
For benchmarking, disable prefix caching by adding --no-enable-prefix-caching
to the server command.
# FP8 benchmark
vllm bench serve \
--model deepseek-ai/DeepSeek-R1-0528 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--request-rate 10000 \
--num-prompts 16 \
--ignore-eos
# FP4 benchmark
vllm bench serve \
--model nvidia/DeepSeek-R1-FP4 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--request-rate 10000 \
--num-prompts 16 \
--ignore-eos
Test different workloads by adjusting input/output lengths:
- Prompt-heavy: 8000 input / 1000 output
- Decode-heavy: 1000 input / 8000 output
- Balanced: 1000 input / 1000 output
Troubleshooting
- Disaggregated Serving with Wide EP (Experimental GB200): See vLLM issue #33583, the vLLM blog post, and this reference fork for GB200 disaggregated serving recipes.