vLLM/Recipes
Moonshot AI

moonshotai/Kimi-Linear-48B-A3B-Instruct

Kimi-Linear is a 48B-parameter instruction-tuned MoE model (~3B activated) with a linear-attention variant supporting very long context (1M tokens).

View on HuggingFace
moe48B / 3B1,048,576 ctxvLLM 0.11.2+text
Guide

Overview

Kimi-Linear is Moonshot AI's 48B-parameter instruction-tuned MoE model (A3B indicates ~3B active parameters per token) featuring a linear-attention variant that enables very long context windows (up to 1,048,576 tokens).

Prerequisites

  • Hardware: 4 or 8 GPUs on a single node
  • vLLM: 0.11.2 recommended. Avoid vLLM 0.12.0, which has a known bug MLAModules.__init__() missing 1 required positional argument: 'indexer_rotary_emb' that affects Kimi-Linear.
uv venv
source .venv/bin/activate
# Install a stable version (avoid 0.12.0)
uv pip install vllm==0.11.2 --torch-backend auto

Running Kimi-Linear

The following snippets assume 4 or 8 GPUs on a single node.

4-GPU Tensor Parallel

vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \
  --port 8000 \
  --tensor-parallel-size 4 \
  --max-model-len 1048576 \
  --trust-remote-code

8-GPU Tensor Parallel

vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \
  --port 8000 \
  --tensor-parallel-size 8 \
  --max-model-len 1048576 \
  --trust-remote-code

If you see OOM, reduce --max-model-len (e.g. 65536) or increase --gpu-memory-utilization (<= 0.95).

Client Usage

Once the server is up, test it with:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"moonshotai/Kimi-Linear-48B-A3B-Instruct","messages":[{"role":"user","content":"Hello!"}]}'

Troubleshooting

  • MLAModules.__init__() missing 1 required positional argument: 'indexer_rotary_emb': Known bug in vLLM 0.12.0 affecting Kimi-Linear. Pin to vllm==0.11.2 instead.
  • OOM: Reduce --max-model-len or increase --gpu-memory-utilization up to 0.95.

References