vLLM/Recipes
poolside

poolside/Laguna-XS.2

Poolside's 33B total / 3B activated MoE coding model with mixed sliding-window + global attention, native interleaved reasoning, and 128K context — designed for agentic coding.

View on HuggingFace
moe33B / 3B131,072 ctxvLLM nightly+text
Guide

Overview

Laguna XS.2 is Poolside's 33B-total / 3B-activated Mixture-of-Experts model purpose-built for agentic coding and long-horizon work. It combines mixed sliding-window + global attention (3:1 across 40 layers) with sigmoid per-head gating and FP8 KV cache, so it stays compact enough to run locally while supporting a 131K-token context.

Key features

  • Mixed SWA + global attention: 30 sliding-window layers (window=512) interleaved with 10 global-attention layers, each with per-layer rotary scaling.
  • Native FP8 KV cache: KV cache is quantized to FP8 to reduce memory per token.
  • Interleaved reasoning: thinking blocks emitted between tool calls; toggled per-request via enable_thinking.
  • Tool calling: Poolside-specific XML-style tool-call protocol, parsed via poolside_v1.
  • 256 experts + 1 shared expert with top-8 routing.

Prerequisites

Laguna XS.2 support is on the open vLLM PR (vllm-project/vllm#41129) — install from a nightly wheel or the pinned Docker image below until the PR lands in a stable release.

docker pull vllm/vllm-openai:laguna

pip (nightly)

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
  --extra-index-url https://wheels.vllm.ai/nightly/cu130 \
  --extra-index-url https://download.pytorch.org/whl/cu130 \
  --index-strategy unsafe-best-match

Launch command

Single GPU (H100/H200/B200, BF16)

vllm serve poolside/Laguna-XS.2 \
  --trust-remote-code \
  --max-model-len 131072 \
  --enable-auto-tool-choice \
  --tool-call-parser poolside_v1 \
  --reasoning-parser poolside_v1

Docker

docker run -itd --name laguna-xs2 \
  --ipc=host --network host --shm-size 16G --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:laguna \
    --model poolside/Laguna-XS.2 \
    --trust-remote-code \
    --max-model-len 131072 \
    --enable-auto-tool-choice \
    --tool-call-parser poolside_v1 \
    --reasoning-parser poolside_v1 \
    --host 0.0.0.0 --port 8000

Controlling reasoning

Reasoning is off by default in the chat template. Enable it per-request:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="poolside/Laguna-XS.2",
    messages=[{"role": "user", "content": "Write a Python retry wrapper with exponential backoff."}],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    temperature=0.7,
    top_p=1.0,
    extra_query={"top_k": 20},
)
print(resp.choices[0].message.reasoning_content)
print(resp.choices[0].message.content)

Or default-on with --default-chat-template-kwargs '{"enable_thinking": true}'.

References