moonshotai/Kimi-K2.7-Code
Coding-focused agentic MoE built on Kimi-K2.6, tuned for long-horizon software-engineering tasks with thinking-only reasoning, tool calling, and vision-language input
1T MoE coding agent with ~30% fewer thinking tokens than Kimi-K2.6
Guide
Overview
Kimi-K2.7-Code is a coding-focused agentic model built on top of Kimi-K2.6. It substantially improves real-world long-horizon coding performance — strengthening end-to-end task completion across complex software-engineering workflows while improving token efficiency, reducing thinking-token usage by ~30% versus Kimi-K2.6.
It shares the same architecture as Kimi-K2.5 / Kimi-K2.6 (1T-parameter MoE, 32B activated,
DeepSeek-V3 backbone with MLA attention and a MoonViT vision encoder), so the K2.6
deployment method can be reused directly. Kimi-K2.7-Code runs in thinking mode only
with preserve_thinking forced on; keep the reasoning parser enabled.
Prerequisites
- vLLM version: >= 0.19.1 (manually verified). The model is also available in the vLLM nightly wheel, but nightly builds are experimental.
- Hardware (INT4): 8x H200 GPUs (verified), or equivalent aggregate VRAM (~640 GB)
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto
Launch command
Serve on a single H200 node with TP8 (vision encoder in data-parallel mode):
vllm serve moonshotai/Kimi-K2.7-Code \
--tensor-parallel-size 8 \
--mm-encoder-tp-mode data \
--trust-remote-code \
--tool-call-parser kimi_k2 \
--enable-auto-tool-choice \
--reasoning-parser kimi_k2
Key notes
--tool-call-parser kimi_k2— required for enabling tool calling.--reasoning-parser kimi_k2— Kimi-K2.7-Code supports thinking mode only; pass this for correct reasoning processing.--mm-encoder-tp-mode data— runs the small MoonViT encoder data-parallel to avoid TP communication overhead.
Client Usage
Once the vLLM server is running, consume it via the OpenAI-compatible API. The recommended
sampling for thinking mode is temperature=1.0 and top_p=0.95.
import time
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
timeout=3600,
)
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/moonshotai/Kimi-K2.7-Code/resolve/main/figures/kimi-logo.png"
},
},
{"type": "text", "text": "Describe this image in detail."},
],
}
]
start = time.time()
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.7-Code",
messages=messages,
max_tokens=4096,
temperature=1.0,
top_p=0.95,
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Reasoning: {response.choices[0].message.reasoning}")
print(f"Response: {response.choices[0].message.content}")
Troubleshooting
- OOM errors: Lower
--gpu-memory-utilizationor adjust TP/EP to match your GPU count. - Vision encoder performance: Use
--mm-encoder-tp-mode datato run the vision encoder in data-parallel mode. The encoder is small, so TP adds communication overhead with little gain. - Unique multimodal inputs: Pass
--mm-processor-cache-gb 0to avoid caching overhead. For repeated inputs,--mm-processor-cache-type shmuses host shared memory for better performance at high TP settings. - Text-only workloads: Pass
--language-model-onlyto skip loading MoonViT and free VRAM for KV cache.