vLLM/Recipes
Moonshot AI

moonshotai/Kimi-K2.7-Code

Coding-focused agentic MoE built on Kimi-K2.6, tuned for long-horizon software-engineering tasks with thinking-only reasoning, tool calling, and vision-language input

1T MoE coding agent with ~30% fewer thinking tokens than Kimi-K2.6

moe1T / 32B262,144 ctxvLLM 0.19.1+multimodaltext
Guide

Overview

Kimi-K2.7-Code is a coding-focused agentic model built on top of Kimi-K2.6. It substantially improves real-world long-horizon coding performance — strengthening end-to-end task completion across complex software-engineering workflows while improving token efficiency, reducing thinking-token usage by ~30% versus Kimi-K2.6.

It shares the same architecture as Kimi-K2.5 / Kimi-K2.6 (1T-parameter MoE, 32B activated, DeepSeek-V3 backbone with MLA attention and a MoonViT vision encoder), so the K2.6 deployment method can be reused directly. Kimi-K2.7-Code runs in thinking mode only with preserve_thinking forced on; keep the reasoning parser enabled.

Prerequisites

  • vLLM version: >= 0.19.1 (manually verified). The model is also available in the vLLM nightly wheel, but nightly builds are experimental.
  • Hardware (INT4): 8x H200 GPUs (verified), or equivalent aggregate VRAM (~640 GB)
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto

Launch command

Serve on a single H200 node with TP8 (vision encoder in data-parallel mode):

vllm serve moonshotai/Kimi-K2.7-Code \
  --tensor-parallel-size 8 \
  --mm-encoder-tp-mode data \
  --trust-remote-code \
  --tool-call-parser kimi_k2 \
  --enable-auto-tool-choice \
  --reasoning-parser kimi_k2

Key notes

  • --tool-call-parser kimi_k2 — required for enabling tool calling.
  • --reasoning-parser kimi_k2 — Kimi-K2.7-Code supports thinking mode only; pass this for correct reasoning processing.
  • --mm-encoder-tp-mode data — runs the small MoonViT encoder data-parallel to avoid TP communication overhead.

Client Usage

Once the vLLM server is running, consume it via the OpenAI-compatible API. The recommended sampling for thinking mode is temperature=1.0 and top_p=0.95.

import time
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
    timeout=3600,
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://huggingface.co/moonshotai/Kimi-K2.7-Code/resolve/main/figures/kimi-logo.png"
                },
            },
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }
]

start = time.time()
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.7-Code",
    messages=messages,
    max_tokens=4096,
    temperature=1.0,
    top_p=0.95,
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Reasoning: {response.choices[0].message.reasoning}")
print(f"Response: {response.choices[0].message.content}")

Troubleshooting

  • OOM errors: Lower --gpu-memory-utilization or adjust TP/EP to match your GPU count.
  • Vision encoder performance: Use --mm-encoder-tp-mode data to run the vision encoder in data-parallel mode. The encoder is small, so TP adds communication overhead with little gain.
  • Unique multimodal inputs: Pass --mm-processor-cache-gb 0 to avoid caching overhead. For repeated inputs, --mm-processor-cache-type shm uses host shared memory for better performance at high TP settings.
  • Text-only workloads: Pass --language-model-only to skip loading MoonViT and free VRAM for KV cache.

References