JetBrains/Mellum2-12B-A2.5B-Thinking

JetBrains' reasoning-augmented code MoE (12B total / 2.5B active) that emits explicit <think> chains for debugging, planning, and agentic coding

69.9 LiveCodeBench v6, 58.4 AIME — fits on a single GPU

View on HuggingFace

moe12B / 2.5B131,072 ctxvLLM 0.23.0+text

Guide

Overview

Mellum2-12B-A2.5B-Thinking is JetBrains' reasoning-augmented code assistant. It uses a Mixture-of-Experts architecture with 64 experts (8 activated per token) — 12B total parameters, 2.5B active — combining sliding-window and full-attention layers for a 131,072-token context. The model emits its chain-of-thought inside <think>...</think> blocks before the final answer, making it suited to complex debugging, multi-step planning, and agentic workflows. For direct, low-latency answers without reasoning traces, use the Instruct variant instead.

Prerequisites

Hardware: a single H200, H100, or A100 (~29 GB at bf16) is plenty
vLLM nightly — MellumForCausalLM support landed after v0.22.0 and is not yet in a stable release. Install the nightly wheels until the next tagged release ships.

Install vLLM (nightly)

uv venv
source .venv/bin/activate
uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly

Launch command

# With reasoning (recommended for the Thinking checkpoint)
vllm serve JetBrains/Mellum2-12B-A2.5B-Thinking \
  --max-model-len 131072 \
  --reasoning-parser qwen3

# Add tool calling
vllm serve JetBrains/Mellum2-12B-A2.5B-Thinking \
  --max-model-len 131072 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Client usage

JetBrains recommends sampling at temperature=0.6, top_p=0.95, top_k=20 for the Thinking checkpoint.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="JetBrains/Mellum2-12B-A2.5B-Thinking",
    messages=[{"role": "user", "content": "Is 1024 a power of 2? Explain your reasoning."}],
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={"top_k": 20},
)
print(resp.choices[0].message.content)

Overview

Prerequisites

Install vLLM (nightly)

Launch command

Client usage

References