deepseek-ai/DeepSeek-V4-Pro

DeepSeek V4 flagship MoE (1.6T total / 49B active) with hybrid CSA+HCA attention, manifold-constrained hyper-connections, Muon-trained on 32T+ tokens, and three-tier reasoning.

Frontier 1.6T/49B reasoning MoE with native FP4+FP8 weights, MTP speculative decoding, and 1M-token context

View on HuggingFace

moe1600B / 49B1,048,576 ctxvLLM 0.20.0+text

Guide

Overview

DeepSeek-V4-Pro is the flagship of the V4 preview family: a 1.6T-total / 49B-active Mixture-of-Experts model. It pairs a hybrid attention stack — Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) — with Manifold-Constrained Hyper-Connections (mHC) to reach 27% of V3.2's per-token inference FLOPs and 10% of V3.2's KV cache at 1M context. Pre-trained on 32T+ tokens with the Muon optimizer for faster convergence; post-training is a two-stage pipeline (domain-specific expert cultivation + unified consolidation via on-policy distillation).

Checkpoint is FP4+FP8 mixed: MoE expert weights are stored in FP4 while the remaining (attention / norm / router) params stay in FP8.

An NVFP4 variant (nvidia/DeepSeek-V4-Pro-NVFP4) is also available — NVIDIA modelopt re-quantizes the MoE experts to standard NVFP4 while attention, shared experts, router head, and MTP stay FP8. Pick it from the Variant row; it runs on Blackwell GPUs with the FP4 indexer cache. Unlike the native FP8 checkpoint, the NVFP4 experts don't support the deep_gemm_mega_moe MoE kernel (FP8-only), so it runs on the default MoE backend.

Reasoning modes

The chat template exposes three reasoning-effort modes:

Non-think — fast, intuitive responses.
Think High — explicit chain-of-thought for complex problem-solving and planning.
Think Max — maximum reasoning effort; requires --max-model-len >= 393216 (384K tokens) to avoid truncation.

Recommended sampling: temperature = 1.0, top_p = 1.0.

OpenAI Client Example

For DeepSeek-V4, keep reasoning controls in chat_template_kwargs, as it exposes a custom Think Max mode via "reasoning_effort": "max".

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
model = "deepseek-ai/DeepSeek-V4-Pro"
messages = [{"role": "user", "content": "What is 17*19? Return only the final integer."}]

# Non-think
resp = client.chat.completions.create(
    model=model,
    messages=messages,
)

# Think High
resp = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {
            "thinking": True,
            "reasoning_effort": "high",
        },
    },
)

# Think Max
resp = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {
            "thinking": True,
            "reasoning_effort": "max",
        },
    },
)

Recommended deployments

B300 (8× GPU): single-node DP + EP with --data-parallel-size 8.
H200 (8× GPU): DP + EP with --data-parallel-size 8. Context is capped at 800K tokens (--max-model-len 800000) to leave KV headroom with dense params replicated across ranks — applies to both single-node and multi-node H200.
MI355X (8× GPU): validated with ROCm + AITER (VLLM_ROCM_USE_AITER=1), --gpu-memory-utilization 0.9, --max-num-seqs 128, --max-num-batched-tokens 8192, and --distributed-executor-backend mp.
GB200 NVL4 (4× GPU per tray): the ~960 GB mixed-precision checkpoint does not fit on one tray; run multi-node DP + EP across 2 trays (8 GPUs total) with --data-parallel-size 8. Pick the "Multi-Node" tab and set nodes to 2.

MI355X (8×288GB)

export VLLM_ROCM_USE_AITER=1

vllm serve deepseek-ai/DeepSeek-V4-Pro \
  --host localhost \
  --port 8001 \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 8 \
  --max-num-seqs 512 \
  --max-num-batched-tokens 8192 \
  --distributed-executor-backend mp \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --tokenizer-mode deepseek_v4 \
  --reasoning-parser deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --compilation-config '{"mode": 3, "cudagraph_mode": "FULL_DECODE_ONLY"}'

MI355X is validated on GSM8K dataset:

Launch command

MODEL=deepseek-ai/DeepSeek-V4-Pro
lm_eval --model local-completions \
  --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=128,max_retries=10,max_gen_toks=2048,timeout=60000 \
  --batch_size auto \
  --tasks gsm8k \
  --num_fewshot 8 \
  --output_path . 2>&1 | tee -a eval.log

Reported result

local-completions ({'model': 'deepseek-ai/DeepSeek-V4-Pro', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 128, 'max_retries': 10, 'max_gen_toks': 2048, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 8, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9538|±  |0.0058|
|     |       |strict-match    |     8|exact_match|↑  |0.9545|±  |0.0057|

KV Cache Offloading

Agentic and multi-turn workloads reuse long prefixes whose KV state can exceed on-GPU capacity. The KV Offload row attaches a host-DRAM KV tier to any serving strategy — pick one of three connectors:

Simple — SimpleCPUOffloadConnector: spills KV blocks to a per-rank region of CPU DRAM on each node. The simplest way to extend effective KV capacity for a single instance.
Mooncake — MooncakeStoreConnector: pools CPU DRAM into a distributed shared KV store — either embedded (each rank's vLLM worker runs a Mooncake client that donates a DRAM segment) or standalone (a per-node mooncake_store_service owns the node's DRAM and contributes it to the pool, decoupling cache lifetime from the engine). A cluster-wide mooncake_master coordinates the pool; across 2+ instances, GPU workers share it for cross-instance prefix reuse. Install from the extra-install block above; see the Mooncake docs.
LMCache — LMCacheMPConnector: a node-local KV pool served by a companion lmcache server process, launched before vllm serve. Single-node strategies only. Install from the extra-install block above; see the LMCache docs.