JetBrains/Mellum2-12B-A2.5B-Instruct

JetBrains' instruction-tuned code MoE (12B total / 2.5B active) that answers directly without an externalized chain of thought — low-latency coding and tool use

78.4 EvalPlus, 67.1 MultiPL-E — direct answers, fits on a single GPU

View on HuggingFace

moe12B / 2.5B131,072 ctxvLLM 0.23.0+text

Guide

Overview

Mellum2-12B-A2.5B-Instruct is JetBrains' instruction-tuned code assistant. It shares the same Mixture-of-Experts backbone as the rest of the Mellum2 family — 64 experts (8 activated per token), 12B total / 2.5B active parameters, sliding-window + full-attention layers, 131,072-token context — but is post-trained (SFT + RLVR on math, coding, tool use, instruction following, reasoning, and knowledge) to answer directly, without an externalized chain of thought. For complex debugging, multi-step planning, or math/reasoning-heavy tasks where you want explicit reasoning traces, use the Thinking variant instead.

Prerequisites

Hardware: a single H200, H100, or A100 (~29 GB at bf16) is plenty
vLLM nightly — MellumForCausalLM support landed after v0.22.0 and is not yet in a stable release. Install the nightly wheels until the next tagged release ships.

Install vLLM (nightly)

uv venv
source .venv/bin/activate
uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly

Launch command

Unlike the Thinking checkpoint, Instruct does not emit <think> blocks, so no --reasoning-parser is needed.

# Plain serving
vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct \
  --max-model-len 131072

# Add tool calling
vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct \
  --max-model-len 131072 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Client usage

JetBrains recommends sampling at temperature=0.6, top_p=0.95, top_k=20.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="JetBrains/Mellum2-12B-A2.5B-Instruct",
    messages=[{"role": "user", "content": "Write a Python function to reverse a string."}],
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={"top_k": 20},
)
print(resp.choices[0].message.content)

Overview

Prerequisites

Install vLLM (nightly)

Launch command

Client usage

References