poolside/Laguna-XS-2.1

Poolside's 33B total / 3B activated MoE coding model with mixed sliding-window + global attention, native interleaved reasoning, and 256K context — designed for agentic coding.

33B/3B-A MoE for agentic coding with interleaved thinking and tool use

View on HuggingFace

moe33B / 3B262,144 ctxvLLM 0.21.0+text

Guide

Overview

Laguna XS-2.1 is Poolside's 33B-total / 3B-activated Mixture-of-Experts model purpose-built for agentic coding and long-horizon work. It combines mixed sliding-window + global attention (3:1 across 40 layers) with sigmoid per-head gating and FP8 KV cache, so it stays compact enough to run locally while supporting a 256K-token context. The 2.1 point release improves coding quality over XS.2 (SWE-bench Multilingual 57.7% → 63.1%, SWE-bench Verified 69.9% → 70.9%).

Key features

Mixed SWA + global attention: 30 sliding-window layers (window=512) interleaved with 10 global-attention layers, each with per-layer rotary scaling.
Native FP8 KV cache: KV cache is quantized to FP8 to reduce memory per token.
Interleaved reasoning: thinking blocks emitted between tool calls; toggled per-request via enable_thinking.
Tool calling: Poolside-specific XML-style tool-call protocol, parsed via poolside_v1.
256 experts + 1 shared expert with top-8 routing.

Prerequisites

Laguna XS-2.1 support is available in vLLM 0.21.0 and later (via PR #41129).

pip

uv pip install -U 'vllm>=0.21.0'

Docker

docker pull vllm/vllm-openai:latest

Launch command

Single GPU (H100/H200/B200, BF16)

vllm serve poolside/Laguna-XS-2.1 \
  --trust-remote-code \
  --max-model-len 262144 \
  --enable-auto-tool-choice \
  --tool-call-parser poolside_v1 \
  --reasoning-parser poolside_v1

Docker

docker run -itd --name laguna-xs21 \
  --ipc=host --network host --shm-size 16G --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
    --model poolside/Laguna-XS-2.1 \
    --trust-remote-code \
    --max-model-len 262144 \
    --enable-auto-tool-choice \
    --tool-call-parser poolside_v1 \
    --reasoning-parser poolside_v1 \
    --host 0.0.0.0 --port 8000

Controlling reasoning

Reasoning is off by default in the chat template. Enable it per-request:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="poolside/Laguna-XS-2.1",
    messages=[{"role": "user", "content": "Write a Python retry wrapper with exponential backoff."}],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    temperature=0.7,
    top_p=1.0,
    extra_query={"top_k": 20},
)
print(resp.choices[0].message.reasoning_content)
print(resp.choices[0].message.content)

Or default-on with --default-chat-template-kwargs '{"enable_thinking": true}'.

Speculative decoding (DFlash)

Enable the Spec Decoding toggle above to attach Poolside's DFlash draft model — a 5-layer Llama-style speculator that proposes up to 15 tokens per step. Reported end-to-end speedup is 1.67x–2.64x across evaluation datasets, with a mean accepted length of 3.55–4.57 tokens per step.

Requires:

vLLM built from PR #46853 (adds DFlash drafter support for Laguna-XS-2.1) — not yet in a stable release.
VLLM_USE_DEEP_GEMM=0 in the launch environment — DeepGEMM is currently incompatible with the DFlash draft path.

Example:

VLLM_USE_DEEP_GEMM=0 vllm serve poolside/Laguna-XS-2.1 \
  --trust-remote-code \
  --max-model-len 16384 \
  --enable-auto-tool-choice \
  --tool-call-parser poolside_v1 \
  --reasoning-parser poolside_v1 \
  --speculative-config '{"model":"poolside/Laguna-XS-2.1-DFlash","num_speculative_tokens":15,"method":"dflash"}'

Overview

Key features

Prerequisites

pip

Docker

Launch command

Single GPU (H100/H200/B200, BF16)

Docker

Controlling reasoning

Speculative decoding (DFlash)

References