vLLM/Recipes
Moonshot AI

moonshotai/Kimi-K2.5

Open-source native multimodal agentic MoE model with vision-language understanding, tool calling, and thinking modes

View on HuggingFace
moe1T / 32B262,144 ctxvLLM 0.19.1+multimodaltext
Guide

Overview

Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.

Prerequisites

  • vLLM version: >= 0.15.0 (speculative decoding with Eagle3 requires >= 0.18.0)
  • Hardware (BF16): 8x H200 GPUs (verified), or equivalent aggregate VRAM (~640 GB)
  • Hardware (NVFP4): 4x Blackwell GPUs (e.g. GB200)
  • AMD support: 8x MI300X / MI325X / MI355X with ROCm 7.2.1 and Python 3.12

Install vLLM

Pip (NVIDIA):

uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend auto

Pip (AMD ROCm):

uv venv --python 3.12
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm

Docker (NVIDIA):

docker pull vllm/vllm-openai:v0.17.0-cu130

Client Usage

Once the vLLM server is running, consume it via the OpenAI-compatible API:

import time
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
    timeout=3600
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
                }
            },
            {
                "type": "text",
                "text": "Read all the text in the image."
            }
        ]
    }
]

start = time.time()
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=messages,
    max_tokens=2048
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")

Troubleshooting

  • OOM errors: Lower --gpu-memory-utilization or adjust TP/EP to match your GPU count.
  • Vision encoder performance: Use --mm-encoder-tp-mode data to run the vision encoder in data-parallel mode. The encoder is small, so TP adds communication overhead with little gain.
  • Unique multimodal inputs: Pass --mm-processor-cache-gb 0 to avoid caching overhead. For repeated inputs, --mm-processor-cache-type shm uses host shared memory for better performance at high TP settings.
  • MoE kernel tuning: Use the benchmark_moe script from vLLM to tune Triton kernels for your specific hardware.
  • Async scheduling: Enabled by default for better throughput. Disable if you encounter issues, and file a bug report to vLLM.

References