moonshotai/Kimi-K2-Thinking

Kimi-K2-Thinking is an advanced reasoning MoE model with native INT4 QAT weights, designed for long-horizon agent workflows interleaving chain-of-thought reasoning with tool calls.

1T MoE thinking model with native INT4 QAT for 2x low-latency speed-up

View on HuggingFace

moe1T / 32B262,144 ctxvLLM 0.12.0+text

Guide

Overview

Kimi-K2-Thinking is an advanced trillion-parameter MoE created by Moonshot AI with these highlights:

Deep Thinking & Tool Orchestration: End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift.
Native INT4 Quantization: Quantization-Aware Training (QAT) delivers lossless 2x speed-up in low-latency mode.
Stable Long-Horizon Agency: Maintains coherent goal-directed behavior across up to 200-300 consecutive tool invocations, surpassing prior models that degrade after 30-50 steps.

Prerequisites

Hardware: 8x H200 or 8x H20 GPUs
vLLM: Current stable release

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Launching Kimi-K2-Thinking with vLLM

Low-Latency Scenarios (TP8)

vllm serve moonshotai/Kimi-K2-Thinking \
  --tensor-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --trust-remote-code

The --reasoning-parser flag specifies the parser used to extract reasoning content from the model output.

High-Throughput Scenarios (TP8+DCP8)

vLLM supports Decode Context Parallel, which provides significant benefits in high-throughput scenarios. Enable DCP by adding --decode-context-parallel-size 8:

vllm serve moonshotai/Kimi-K2-Thinking \
  --tensor-parallel-size 8 \
  --decode-context-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --trust-remote-code

Metrics (GSM8K)

Config	exact_match (flexible)	exact_match (strict)
TP8	0.9416	0.9409
TP8+DCP8	0.9386	0.9371

Benchmarking

We used the following script to benchmark moonshotai/Kimi-K2-Thinking on 8xH200:

vllm bench serve \
  --model moonshotai/Kimi-K2-Thinking \
  --dataset-name random \
  --random-input 8000 \
  --random-output 4000 \
  --request-rate 100 \
  --num-prompt 1000 \
  --trust-remote-code

DCP Gain Analysis

Metric	TP8	TP8+DCP8	Change	Improvement
Request throughput (req/s)	1.25	1.57	+0.32	+25.6%
Output token throughput (tok/s)	485.78	695.13	+209.35	+43.1%
Mean TTFT (s)	271.2	227.8	-43.4	+16.0%

DCP multiplies the GPU KV cache size by dcp_world_size:

TP8 KV cache: 715,072 tokens
TP8+DCP8 KV cache: 5,721,088 tokens (8x)

Enabling DCP delivers strong advantages (43% faster token generation, 26% higher throughput) with minimal drawbacks. Read the DCP doc and try it in your LLM workloads.