vLLM/Recipes
DeepSeek

deepseek-ai/DeepSeek-V3.1

DeepSeek-V3.1 is a hybrid MoE model that supports dynamic switching between thinking and non-thinking modes, with tool calling and function execution.

View on HuggingFace
moe671B / 37B163,840 ctxvLLM 0.12.0+text
Guide

Overview

DeepSeek-V3.1 is a hybrid MoE model that supports both thinking and non-thinking modes. You can dynamically switch between the two modes from the client by passing extra_body={"chat_template_kwargs": {"thinking": True|False}}.

Prerequisites

  • Hardware: 8x H200 (or H20) GPUs (141 GB per GPU)
  • vLLM: Current stable release
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Launching DeepSeek-V3.1

Serving on 8xH200 (or H20) GPUs

vllm serve deepseek-ai/DeepSeek-V3.1 \
  --enable-expert-parallel \
  --tensor-parallel-size 8 \
  --served-model-name ds31

Function calling

vLLM supports user-defined tool calling for DeepSeek-V3.1. Add these flags when launching the server. The example chat template ships in the official container and can also be downloaded from the vLLM repo: tool_chat_template_deepseekv31.jinja.

vllm serve ... \
  --enable-auto-tool-choice \
  --tool-call-parser deepseek_v31 \
  --chat-template examples/tool_chat_template_deepseekv31.jinja

Client Usage

OpenAI Python SDK

Control thinking mode via extra_body={"chat_template_kwargs": {"thinking": False}} (or True to enable thinking).

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id

messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Who are you?"},
    {"role": "assistant", "content": "<think>Hmm</think>I am DeepSeek"},
    {"role": "user", "content": "9.11 and 9.8, which is greater?"},
]
response = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={"chat_template_kwargs": {"thinking": False}},
)
print(response.choices[0].message.content)

When thinking=True, output includes a </think> segment delimiting chain-of-thought; when thinking=False, the model produces a direct answer without the thinking segment.

curl

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "ds31",
        "messages": [
            {"role": "user", "content": "9.11 and 9.8, which is greater?"}
        ],
        "chat_template_kwargs": {"thinking": true}
    }'

References