vLLM/Recipes
DeepSeek

deepseek-ai/DeepSeek-V4-Pro

DeepSeek V4 flagship MoE (1.6T total / 49B active) with hybrid CSA+HCA attention, manifold-constrained hyper-connections, Muon-trained on 32T+ tokens, and three-tier reasoning.

View on HuggingFace
moe1600B / 49B1,048,576 ctxvLLM 0.20.1+text
Guide

Overview

DeepSeek-V4-Pro is the flagship of the V4 preview family: a 1.6T-total / 49B-active Mixture-of-Experts model. It pairs a hybrid attention stack — Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) — with Manifold-Constrained Hyper-Connections (mHC) to reach 27% of V3.2's per-token inference FLOPs and 10% of V3.2's KV cache at 1M context. Pre-trained on 32T+ tokens with the Muon optimizer for faster convergence; post-training is a two-stage pipeline (domain-specific expert cultivation + unified consolidation via on-policy distillation).

Checkpoint is FP4+FP8 mixed: MoE expert weights are stored in FP4 while the remaining (attention / norm / router) params stay in FP8.

Reasoning modes

The chat template exposes three reasoning-effort modes:

  • Non-think — fast, intuitive responses.
  • Think High — explicit chain-of-thought for complex problem-solving and planning.
  • Think Max — maximum reasoning effort; requires --max-model-len >= 393216 (384K tokens) to avoid truncation.

Recommended sampling: temperature = 1.0, top_p = 1.0.

OpenAI Client Example

For DeepSeek-V4, keep reasoning controls in chat_template_kwargs, as it exposes a custom Think Max mode via "reasoning_effort": "max".

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
model = "deepseek-ai/DeepSeek-V4-Pro"
messages = [{"role": "user", "content": "What is 17*19? Return only the final integer."}]

# Non-think
resp = client.chat.completions.create(
    model=model,
    messages=messages,
)

# Think High
resp = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {
            "thinking": True,
            "reasoning_effort": "high",
        },
    },
)

# Think Max
resp = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {
            "thinking": True,
            "reasoning_effort": "max",
        },
    },
)
  • B300 (8× GPU): single-node DP + EP with --data-parallel-size 8.
  • H200 (8× GPU): DP + EP with --data-parallel-size 8. Context is capped at 800K tokens (--max-model-len 800000) to leave KV headroom with dense params replicated across ranks — applies to both single-node and multi-node H200.
  • GB200 NVL4 (4× GPU per tray): the ~960 GB mixed-precision checkpoint does not fit on one tray; run multi-node DP + EP across 2 trays (8 GPUs total) with --data-parallel-size 8. Pick the "Multi-Node" tab and set nodes to 2.