moonshotai/Kimi-K2.6
Open-source native multimodal agentic MoE model with vision-language understanding, tool calling, and thinking modes
View on HuggingFaceGuide
Overview
Kimi K2.6 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.
Prerequisites
- vLLM version: >= 0.19.1
- Hardware (INT4): 8x H200 GPUs (verified), or equivalent aggregate VRAM (~640 GB)
- AMD support: 8x MI300X / MI325X / MI355X with ROCm 7.2.1 and Python 3.12
Client Usage
Once the vLLM server is running, consume it via the OpenAI-compatible API:
import time
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
timeout=3600
)
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
}
},
{
"type": "text",
"text": "Read all the text in the image."
}
]
}
]
start = time.time()
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.6",
messages=messages,
max_tokens=2048
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")
Troubleshooting
- OOM errors: Lower
--gpu-memory-utilizationor adjust TP/EP to match your GPU count. - Vision encoder performance: Use
--mm-encoder-tp-mode datato run the vision encoder in data-parallel mode. The encoder is small, so TP adds communication overhead with little gain. - Unique multimodal inputs: Pass
--mm-processor-cache-gb 0to avoid caching overhead. For repeated inputs,--mm-processor-cache-type shmuses host shared memory for better performance at high TP settings. - MoE kernel tuning: Use the
benchmark_moescript from vLLM to tune Triton kernels for your specific hardware. - Async scheduling: Enabled by default for better throughput. Disable if you encounter issues, and file a bug report to vLLM.