openbmb/MiniCPM-V-4.6
MiniCPM-V 4.6 (1.3B) — pocket-sized multimodal LLM for ultra-efficient single-image, multi-image, and video understanding, built on SigLIP2-400M + a Qwen3.5-0.8B hybrid-attention backbone
~1.5× token throughput vs Qwen3.5-0.8B with mixed 4×/16× visual token compression
Guide
Overview
MiniCPM-V 4.6 is OpenBMB's most edge-friendly multimodal model to date.
It pairs a SigLIP2-400M vision encoder with a Qwen3.5-0.8B hybrid
linear/full-attention language backbone (~1.3B params total). It handles
single-image, multi-image, and video understanding, and introduces mixed
4×/16× visual token compression — switch the downsample_mode per request
to trade accuracy for speed.
MiniCPM-V 4.6 ships as two independent checkpoints — this recipe covers the Instruct model (
openbmb/MiniCPM-V-4.6). For chain-of-thought output, use the separateopenbmb/MiniCPM-V-4.6-Thinkingcheckpoint instead (unlike v4.5, the mode is no longer toggled at runtime).
Prerequisites
- vLLM ≥ 0.22.0 — the
MiniCPMV4_6ForConditionalGenerationarchitecture landed via PR #43213 (merged 2026-05-22) and first shipped in the v0.22.0 release. No fork is required. - transformers ≥ 5.7.0 — the model is merged as a standalone architecture only in this version.
- For video inputs, install the
vllm[video]extra.
Launch command
Use the command builder above. The baseline is simply:
vllm serve openbmb/MiniCPM-V-4.6 --trust-remote-code --port 8000
Enable the Tool calling feature to add --enable-auto-tool-choice --tool-call-parser qwen3_coder. The model emits a Qwen3-Coder-style
<tool_call> block inside the message content.
At 1.3B params the model fits on a single GPU (TP=1). v4.6 supports context up
to 256K, but start with a smaller --max-model-len (e.g. 8192) and raise it
to fit your GPU memory and workload.
Client usage
Image:
curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "openbmb/MiniCPM-V-4.6",
"messages": [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}},
{"type": "text", "text": "What causes this phenomenon?"}
]}]
}'
Stop tokens: v4.6 uses the Qwen3.5 backbone with a new vocab. If you see runaway generations, pass
"stop_token_ids": [248044, 248046]in your request body.