moonshotai/Kimi-K2-Thinking
Kimi-K2-Thinking is an advanced reasoning MoE model with native INT4 QAT weights, designed for long-horizon agent workflows interleaving chain-of-thought reasoning with tool calls.
View on HuggingFaceGuide
Overview
Kimi-K2-Thinking is an advanced trillion-parameter MoE created by Moonshot AI with these highlights:
- Deep Thinking & Tool Orchestration: End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift.
- Native INT4 Quantization: Quantization-Aware Training (QAT) delivers lossless 2x speed-up in low-latency mode.
- Stable Long-Horizon Agency: Maintains coherent goal-directed behavior across up to 200-300 consecutive tool invocations, surpassing prior models that degrade after 30-50 steps.
Prerequisites
- Hardware: 8x H200 or 8x H20 GPUs
- vLLM: Current stable release
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Launching Kimi-K2-Thinking with vLLM
Low-Latency Scenarios (TP8)
vllm serve moonshotai/Kimi-K2-Thinking \
--tensor-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--trust-remote-code
The --reasoning-parser flag specifies the parser used to extract reasoning content
from the model output.
High-Throughput Scenarios (TP8+DCP8)
vLLM supports Decode Context Parallel,
which provides significant benefits in high-throughput scenarios. Enable DCP by adding
--decode-context-parallel-size 8:
vllm serve moonshotai/Kimi-K2-Thinking \
--tensor-parallel-size 8 \
--decode-context-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--trust-remote-code
Metrics (GSM8K)
| Config | exact_match (flexible) | exact_match (strict) |
|---|---|---|
| TP8 | 0.9416 | 0.9409 |
| TP8+DCP8 | 0.9386 | 0.9371 |
Benchmarking
We used the following script to benchmark moonshotai/Kimi-K2-Thinking on 8xH200:
vllm bench serve \
--model moonshotai/Kimi-K2-Thinking \
--dataset-name random \
--random-input 8000 \
--random-output 4000 \
--request-rate 100 \
--num-prompt 1000 \
--trust-remote-code
DCP Gain Analysis
| Metric | TP8 | TP8+DCP8 | Change | Improvement |
|---|---|---|---|---|
| Request throughput (req/s) | 1.25 | 1.57 | +0.32 | +25.6% |
| Output token throughput (tok/s) | 485.78 | 695.13 | +209.35 | +43.1% |
| Mean TTFT (s) | 271.2 | 227.8 | -43.4 | +16.0% |
DCP multiplies the GPU KV cache size by dcp_world_size:
- TP8 KV cache:
715,072tokens - TP8+DCP8 KV cache:
5,721,088tokens (8x)
Enabling DCP delivers strong advantages (43% faster token generation, 26% higher throughput) with minimal drawbacks. Read the DCP doc and try it in your LLM workloads.