poolside/Laguna-M.1
Poolside's 225B total / 23B activated MoE coding model with global attention, native interleaved reasoning, and 256K context — designed for agentic coding and long-horizon work.
225B/23B-A MoE for agentic coding — 74.6% SWE-bench Verified
Guide
Overview
Laguna M.1 is Poolside's 225B-total / 23B-activated Mixture-of-Experts model purpose-built for agentic coding and long-horizon work. It is competitive with state-of-the-art open-weight and frontier models on SWE-bench Verified (74.6%), SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0.
Key features
- Large sparse MoE: 70-layer transformer — the first 3 layers are dense SwiGLU, the remaining 67 are sparse MoE with 256 experts + 1 shared expert and top-k=16 routing with auxiliary-loss-free load balancing.
- Global attention: global attention across all layers with 64 Q-heads, 8 KV-heads, head dim 128, and softplus attention output gating.
- 256K context: RoPE with YaRN extends the context window to 262,144 tokens.
- Native interleaved reasoning: thinking blocks emitted before and between tool calls, with preserved thinking (returning prior
reasoning_contentin history) recommended for best agentic performance. Toggled per-request viaenable_thinking. - Tool calling: Poolside-specific tool-call protocol, parsed via
poolside_v1.
Prerequisites
Laguna M.1 support landed in vLLM via PR #41129 (shared implementation with Laguna XS.2) and ships in vLLM 0.21.0 and later.
uv pip install -U 'vllm>=0.21.0'
FP8 and NVFP4 quantized checkpoints are available at Laguna-M.1-FP8 and Laguna-M.1-NVFP4. Quantization is detected automatically from the checkpoint's quantization_config, so the launch command is identical — just swap the model ID (select the variant above).
Launch command
Single node (8×H200, BF16)
vllm serve poolside/Laguna-M.1 \
--tensor-parallel-size 8 \
--served-model-name laguna \
--enable-auto-tool-choice \
--tool-call-parser poolside_v1 \
--reasoning-parser poolside_v1 \
--default-chat-template-kwargs '{"enable_thinking": true}'
The BF16 checkpoint (~450 GB of weights) fits on a single 8×H200 node. Use the FP8 variant to run on 4 GPUs, or the NVFP4 variant (Blackwell only) to fit within ~135 GB.
Controlling reasoning
Reasoning is enabled at the server when you pass --default-chat-template-kwargs '{"enable_thinking": true}' (included above). Poolside recommends preserved thinking for agentic coding — return the previous assistant turn's reasoning_content in the message history.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="laguna",
messages=[{"role": "user", "content": "Write a Python retry wrapper with exponential backoff."}],
temperature=1.0,
extra_query={"top_k": 20},
stream=True,
)
for chunk in resp:
delta = chunk.choices[0].delta
if getattr(delta, "reasoning_content", None):
print(delta.reasoning_content, end="")
if delta.content:
print(delta.content, end="")
Recommended sampling (matching Poolside's benchmark runs): temperature=1.0, top_k=20, thinking enabled.
To disable thinking, omit --default-chat-template-kwargs at launch, or pass extra_body={"chat_template_kwargs": {"enable_thinking": False}} per request.