poolside/Laguna-M.1

Poolside's 225B total / 23B activated MoE coding model with global attention, native interleaved reasoning, and 256K context — designed for agentic coding and long-horizon work.

225B/23B-A MoE for agentic coding — 74.6% SWE-bench Verified

View on HuggingFace

moe225B / 23B262,144 ctxvLLM 0.21.0+text

Guide

Overview

Laguna M.1 is Poolside's 225B-total / 23B-activated Mixture-of-Experts model purpose-built for agentic coding and long-horizon work. It is competitive with state-of-the-art open-weight and frontier models on SWE-bench Verified (74.6%), SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0.

Key features

Large sparse MoE: 70-layer transformer — the first 3 layers are dense SwiGLU, the remaining 67 are sparse MoE with 256 experts + 1 shared expert and top-k=16 routing with auxiliary-loss-free load balancing.
Global attention: global attention across all layers with 64 Q-heads, 8 KV-heads, head dim 128, and softplus attention output gating.
256K context: RoPE with YaRN extends the context window to 262,144 tokens.
Native interleaved reasoning: thinking blocks emitted before and between tool calls, with preserved thinking (returning prior reasoning_content in history) recommended for best agentic performance. Toggled per-request via enable_thinking.
Tool calling: Poolside-specific tool-call protocol, parsed via poolside_v1.

Prerequisites

Laguna M.1 support landed in vLLM via PR #41129 (shared implementation with Laguna XS.2) and ships in vLLM 0.21.0 and later.

uv pip install -U 'vllm>=0.21.0'

FP8 and NVFP4 quantized checkpoints are available at Laguna-M.1-FP8 and Laguna-M.1-NVFP4. Quantization is detected automatically from the checkpoint's quantization_config, so the launch command is identical — just swap the model ID (select the variant above).

Launch command

Single node (8×H200, BF16)

vllm serve poolside/Laguna-M.1 \
  --tensor-parallel-size 8 \
  --served-model-name laguna \
  --enable-auto-tool-choice \
  --tool-call-parser poolside_v1 \
  --reasoning-parser poolside_v1 \
  --default-chat-template-kwargs '{"enable_thinking": true}'

The BF16 checkpoint (~450 GB of weights) fits on a single 8×H200 node. Use the FP8 variant to run on 4 GPUs, or the NVFP4 variant (Blackwell only) to fit within ~135 GB.

Controlling reasoning

Reasoning is enabled at the server when you pass --default-chat-template-kwargs '{"enable_thinking": true}' (included above). Poolside recommends preserved thinking for agentic coding — return the previous assistant turn's reasoning_content in the message history.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="laguna",
    messages=[{"role": "user", "content": "Write a Python retry wrapper with exponential backoff."}],
    temperature=1.0,
    extra_query={"top_k": 20},
    stream=True,
)
for chunk in resp:
    delta = chunk.choices[0].delta
    if getattr(delta, "reasoning_content", None):
        print(delta.reasoning_content, end="")
    if delta.content:
        print(delta.content, end="")

Recommended sampling (matching Poolside's benchmark runs): temperature=1.0, top_k=20, thinking enabled.

To disable thinking, omit --default-chat-template-kwargs at launch, or pass extra_body={"chat_template_kwargs": {"enable_thinking": False}} per request.