vLLM/Recipes
DeepSeek

deepseek-ai/DeepSeek-R1

DeepSeek-R1 is a 671B-parameter MoE reasoning model built on the DeepSeek-V3 architecture, trained with large-scale reinforcement learning for strong chain-of-thought capabilities.

View on HuggingFace
moe671B / 37B163,840 ctxvLLM 0.12.0+text
Guide

Overview

DeepSeek-R1 is a 671B-parameter Mixture-of-Experts reasoning model (37B activated per token) that shares its architecture with DeepSeek-V3, so the same launch recipes apply to both. DeepSeek publishes a refreshed checkpoint as DeepSeek-R1-0528, and NVIDIA publishes an FP4 quantized variant (nvidia/DeepSeek-R1-FP4) that runs on Blackwell GPUs with fewer devices.

Prerequisites

  • Hardware (FP8): 8x H200 GPUs (verified)
  • Hardware (FP4): 4x B200 GPUs
  • vLLM: Install with uv pip install -U vllm --torch-backend auto
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Client Usage

8xH200 (FP8)

Tensor Parallel + Expert Parallel (TP8+EP):

vllm serve deepseek-ai/DeepSeek-R1-0528 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --enable-expert-parallel

Data Parallel + Expert Parallel (DP8+EP):

vllm serve deepseek-ai/DeepSeek-R1-0528 \
  --trust-remote-code \
  --data-parallel-size 8 \
  --enable-expert-parallel

4xB200 (FP4)

Enable FlashInfer MoE kernels before launching:

# For FP4 (recommended on Blackwell)
export VLLM_USE_FLASHINFER_MOE_FP4=1
# For FP8 on Blackwell
export VLLM_USE_FLASHINFER_MOE_FP8=1

Tensor Parallel + Expert Parallel (TP4+EP):

CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve nvidia/DeepSeek-R1-FP4 \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --enable-expert-parallel

Data Parallel + Expert Parallel (DP4+EP):

CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve nvidia/DeepSeek-R1-FP4 \
  --trust-remote-code \
  --data-parallel-size 4 \
  --enable-expert-parallel

Benchmarking

For benchmarking, disable prefix caching by adding --no-enable-prefix-caching to the server command.

# FP8 benchmark
vllm bench serve \
  --model deepseek-ai/DeepSeek-R1-0528 \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --request-rate 10000 \
  --num-prompts 16 \
  --ignore-eos
# FP4 benchmark
vllm bench serve \
  --model nvidia/DeepSeek-R1-FP4 \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --request-rate 10000 \
  --num-prompts 16 \
  --ignore-eos

Test different workloads by adjusting input/output lengths:

  • Prompt-heavy: 8000 input / 1000 output
  • Decode-heavy: 1000 input / 8000 output
  • Balanced: 1000 input / 1000 output

Troubleshooting

References