JetBrains/Mellum2-12B-A2.5B-Thinking
JetBrains' reasoning-augmented code MoE (12B total / 2.5B active) that emits explicit <think> chains for debugging, planning, and agentic coding
69.9 LiveCodeBench v6, 58.4 AIME — fits on a single GPU
Guide
Overview
Mellum2-12B-A2.5B-Thinking
is JetBrains' reasoning-augmented code assistant. It uses a Mixture-of-Experts architecture
with 64 experts (8 activated per token) — 12B total parameters, 2.5B active — combining
sliding-window and full-attention layers for a 131,072-token context. The model emits its
chain-of-thought inside <think>...</think> blocks before the final answer, making it
suited to complex debugging, multi-step planning, and agentic workflows. For direct,
low-latency answers without reasoning traces, use the
Instruct variant instead.
Prerequisites
- Hardware: a single H200, H100, or A100 (~29 GB at bf16) is plenty
- vLLM nightly —
MellumForCausalLMsupport landed after v0.22.0 and is not yet in a stable release. Install the nightly wheels until the next tagged release ships.
Install vLLM (nightly)
uv venv
source .venv/bin/activate
uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly
Launch command
# With reasoning (recommended for the Thinking checkpoint)
vllm serve JetBrains/Mellum2-12B-A2.5B-Thinking \
--max-model-len 131072 \
--reasoning-parser qwen3
# Add tool calling
vllm serve JetBrains/Mellum2-12B-A2.5B-Thinking \
--max-model-len 131072 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser hermes
Client usage
JetBrains recommends sampling at temperature=0.6, top_p=0.95, top_k=20 for the
Thinking checkpoint.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="JetBrains/Mellum2-12B-A2.5B-Thinking",
messages=[{"role": "user", "content": "Is 1024 a power of 2? Explain your reasoning."}],
max_tokens=81920,
temperature=0.6,
top_p=0.95,
extra_body={"top_k": 20},
)
print(resp.choices[0].message.content)