moonshotai/Kimi-K2.5
Open-source native multimodal agentic MoE model with vision-language understanding, tool calling, and thinking modes
View on HuggingFaceGuide
Overview
Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.
Prerequisites
- vLLM version: >= 0.15.0 (speculative decoding with Eagle3 requires >= 0.18.0)
- Hardware (BF16): 8x H200 GPUs (verified), or equivalent aggregate VRAM (~640 GB)
- Hardware (NVFP4): 4x Blackwell GPUs (e.g. GB200)
- AMD support: 8x MI300X / MI325X / MI355X with ROCm 7.2.1 and Python 3.12
Install vLLM
Pip (NVIDIA):
uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend auto
Pip (AMD ROCm):
uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm
Docker (NVIDIA):
docker pull vllm/vllm-openai:v0.17.0-cu130
Client Usage
Once the vLLM server is running, consume it via the OpenAI-compatible API:
import time
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
timeout=3600
)
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
}
},
{
"type": "text",
"text": "Read all the text in the image."
}
]
}
]
start = time.time()
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=messages,
max_tokens=2048
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")
Troubleshooting
- OOM errors: Lower
--gpu-memory-utilizationor adjust TP/EP to match your GPU count. - Vision encoder performance: Use
--mm-encoder-tp-mode datato run the vision encoder in data-parallel mode. The encoder is small, so TP adds communication overhead with little gain. - Unique multimodal inputs: Pass
--mm-processor-cache-gb 0to avoid caching overhead. For repeated inputs,--mm-processor-cache-type shmuses host shared memory for better performance at high TP settings. - MoE kernel tuning: Use the
benchmark_moescript from vLLM to tune Triton kernels for your specific hardware. - Async scheduling: Enabled by default for better throughput. Disable if you encounter issues, and file a bug report to vLLM.