openbmb/MiniCPM5-1B

MiniCPM5-1B — dense 1B on-device LLM with hybrid Think/No-Think reasoning, native 128K context, and strong agentic tool use, built on the standard Llama architecture

1B-class open-source SOTA on tool use, code, and reasoning

View on HuggingFace

dense1.1B131,072 ctxvLLM 0.21.0+text

Guide

Overview

MiniCPM5-1B is the first checkpoint in OpenBMB's MiniCPM5 series — a dense 1B model built for on-device and resource-constrained deployment, reaching 1B-class open-source SOTA on agentic tool use, code generation, and difficult reasoning. It uses the standard LlamaForCausalLM architecture, so vLLM loads it natively with no custom kernels or model-code fork.

Its headline feature is hybrid reasoning: a single checkpoint serves as both a fast assistant (No-Think) and a deliberate reasoner (Think), toggled by the chat template's enable_thinking flag.

Prerequisites

vLLM ≥ 0.21.0 — MiniCPM5-1B is supported natively as of the v0.21.0 release. (For CUDA 12.x driver hosts, the cookbook suggests vllm==0.10.1.1 as a fallback.)

Launch command

Use the command builder above. The baseline is simply:

vllm serve openbmb/MiniCPM5-1B --port 8000

At ~1.1B params the model fits on a single GPU (TP=1). It supports the full native 128K context; drop --max-model-len to 8192 / 32768 to free KV cache on small GPUs, and set --enforce-eager if CUDA graphs OOM on a tiny VRAM budget.

Reasoning modes

Toggle the Reasoning feature to serve with deep-thinking on by default (--default-chat-template-kwargs '{"enable_thinking": true}'). You can also flip it per request via chat_template_kwargs:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openbmb/MiniCPM5-1B",
    "messages": [{"role": "user", "content": "Explain GQA in one sentence."}],
    "temperature": 0.9, "top_p": 0.95, "max_tokens": 1024,
    "chat_template_kwargs": {"enable_thinking": true}
  }'

Mode	`enable_thinking`	`temperature`	`top_p`
Think	`true`	0.9	0.95
No-Think	`false`	0.7	0.95

Tool calling

MiniCPM5-1B emits XML-style tool calls. The vLLM-side minicpm5 parser (PR #43175) merged to main on 2026-05-27 but is not in v0.21.0 or v0.22.0 — those releases were cut before the merge. Until a release bakes it in (v0.23+), load it as a plugin from the MiniCPM repo:

vllm serve openbmb/MiniCPM5-1B --port 8000 \
  --enable-auto-tool-choice \
  --tool-parser-plugin /path/to/MiniCPM/tool_parsers/minicpm5xml_tool_parser.py \
  --tool-call-parser minicpm5

SGLang ships the minicpm5 parser built-in and is the author-recommended backend for tool calling.

Overview

Prerequisites

Launch command

Reasoning modes

Tool calling

References