vLLM/Recipes
DeepSeek

deepseek-ai/DeepSeek-OCR-2

Next-generation DeepSeek OCR model with improved document-to-markdown grounding and optical context compression.

View on HuggingFace
dense3B8,192 ctxvLLM 0.12.0+multimodal
Guide

Overview

DeepSeek-OCR-2 is a frontier OCR model exploring optical context compression for LLMs. It iterates on DeepSeek-OCR with better grounding and markdown conversion, and supports prompts like <image>\n<|grounding|>Convert the document to markdown. for richer document parsing.

Prerequisites

  • Hardware: Single GPU with >=8 GB VRAM is typically sufficient for BF16 inference.
  • vLLM: Current stable release (tested with uv pip install -U vllm --torch-backend auto).
  • Python: 3.10+

Install vLLM:

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Client Usage

Offline OCR (Python)

from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image

llm = LLM(
    model="deepseek-ai/DeepSeek-OCR-2",
    enable_prefix_caching=False,
    mm_processor_cache_gb=0,
    logits_processors=[NGramPerReqLogitsProcessor],
)

image_1 = Image.open("path/to/your/image_1.png").convert("RGB")
image_2 = Image.open("path/to/your/image_2.png").convert("RGB")
# prompt = "<image>\nFree OCR. "
prompt = "<image>\n<|grounding|>Convert the document to markdown. "

model_input = [
    {"prompt": prompt, "multi_modal_data": {"image": image_1}},
    {"prompt": prompt, "multi_modal_data": {"image": image_2}},
]

sampling_param = SamplingParams(
    temperature=0.0,
    max_tokens=8192,
    extra_args=dict(
        ngram_size=30,
        window_size=90,
        whitelist_token_ids={128821, 128822},  # <td>, </td>
    ),
    skip_special_tokens=False,
)

for output in llm.generate(model_input, sampling_param):
    print(output.outputs[0].text)

Online OCR serving

vllm serve deepseek-ai/DeepSeek-OCR-2 \
  --logits_processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor \
  --no-enable-prefix-caching \
  --mm-processor-cache-gb 0
import time
from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1", timeout=3600)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"}},
            {"type": "text", "text": "Free OCR."},
        ],
    }
]

start = time.time()
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-OCR-2",
    messages=messages,
    max_tokens=2048,
    temperature=0.0,
    extra_body={
        "skip_special_tokens": False,
        "vllm_xargs": {
            "ngram_size": 30,
            "window_size": 90,
            "whitelist_token_ids": [128821, 128822],
        },
    },
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")

Troubleshooting / Configuration Tips

  • Use the custom logits processor along with the model for optimal OCR and markdown generation performance.
  • Unlike multi-turn chat, OCR tasks do not typically benefit from prefix caching or image reuse, so disable these features to avoid unnecessary hashing and caching overhead.
  • DeepSeek-OCR-2 works better with plain prompts than instruction formats. See the official main prompts.
  • Depending on your hardware, adjust max_num_batched_tokens for better throughput.

References