poolside/Laguna-XS.2
Poolside's 33B total / 3B activated MoE coding model with mixed sliding-window + global attention, native interleaved reasoning, and 128K context — designed for agentic coding.
View on HuggingFaceGuide
Overview
Laguna XS.2 is Poolside's 33B-total / 3B-activated Mixture-of-Experts model purpose-built for agentic coding and long-horizon work. It combines mixed sliding-window + global attention (3:1 across 40 layers) with sigmoid per-head gating and FP8 KV cache, so it stays compact enough to run locally while supporting a 131K-token context.
Key features
- Mixed SWA + global attention: 30 sliding-window layers (window=512) interleaved with 10 global-attention layers, each with per-layer rotary scaling.
- Native FP8 KV cache: KV cache is quantized to FP8 to reduce memory per token.
- Interleaved reasoning: thinking blocks emitted between tool calls; toggled per-request via
enable_thinking. - Tool calling: Poolside-specific XML-style tool-call protocol, parsed via
poolside_v1. - 256 experts + 1 shared expert with top-8 routing.
Prerequisites
Laguna XS.2 support is on the open vLLM PR (vllm-project/vllm#41129) — install from a nightly wheel or the pinned Docker image below until the PR lands in a stable release.
Docker (recommended)
docker pull vllm/vllm-openai:laguna
pip (nightly)
uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
--extra-index-url https://wheels.vllm.ai/nightly/cu130 \
--extra-index-url https://download.pytorch.org/whl/cu130 \
--index-strategy unsafe-best-match
Launch command
Single GPU (H100/H200/B200, BF16)
vllm serve poolside/Laguna-XS.2 \
--trust-remote-code \
--max-model-len 131072 \
--enable-auto-tool-choice \
--tool-call-parser poolside_v1 \
--reasoning-parser poolside_v1
Docker
docker run -itd --name laguna-xs2 \
--ipc=host --network host --shm-size 16G --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:laguna \
--model poolside/Laguna-XS.2 \
--trust-remote-code \
--max-model-len 131072 \
--enable-auto-tool-choice \
--tool-call-parser poolside_v1 \
--reasoning-parser poolside_v1 \
--host 0.0.0.0 --port 8000
Controlling reasoning
Reasoning is off by default in the chat template. Enable it per-request:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="poolside/Laguna-XS.2",
messages=[{"role": "user", "content": "Write a Python retry wrapper with exponential backoff."}],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
temperature=0.7,
top_p=1.0,
extra_query={"top_k": 20},
)
print(resp.choices[0].message.reasoning_content)
print(resp.choices[0].message.content)
Or default-on with --default-chat-template-kwargs '{"enable_thinking": true}'.