poolside/Laguna-XS-2.1
Poolside's 33B total / 3B activated MoE coding model with mixed sliding-window + global attention, native interleaved reasoning, and 256K context — designed for agentic coding.
33B/3B-A MoE for agentic coding with interleaved thinking and tool use
Guide
Overview
Laguna XS-2.1 is Poolside's 33B-total / 3B-activated Mixture-of-Experts model purpose-built for agentic coding and long-horizon work. It combines mixed sliding-window + global attention (3:1 across 40 layers) with sigmoid per-head gating and FP8 KV cache, so it stays compact enough to run locally while supporting a 256K-token context. The 2.1 point release improves coding quality over XS.2 (SWE-bench Multilingual 57.7% → 63.1%, SWE-bench Verified 69.9% → 70.9%).
Key features
- Mixed SWA + global attention: 30 sliding-window layers (window=512) interleaved with 10 global-attention layers, each with per-layer rotary scaling.
- Native FP8 KV cache: KV cache is quantized to FP8 to reduce memory per token.
- Interleaved reasoning: thinking blocks emitted between tool calls; toggled per-request via
enable_thinking. - Tool calling: Poolside-specific XML-style tool-call protocol, parsed via
poolside_v1. - 256 experts + 1 shared expert with top-8 routing.
Prerequisites
Laguna XS-2.1 support is available in vLLM 0.21.0 and later (via PR #41129).
pip
uv pip install -U 'vllm>=0.21.0'
Docker
docker pull vllm/vllm-openai:latest
Launch command
Single GPU (H100/H200/B200, BF16)
vllm serve poolside/Laguna-XS-2.1 \
--trust-remote-code \
--max-model-len 262144 \
--enable-auto-tool-choice \
--tool-call-parser poolside_v1 \
--reasoning-parser poolside_v1
Docker
docker run -itd --name laguna-xs21 \
--ipc=host --network host --shm-size 16G --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model poolside/Laguna-XS-2.1 \
--trust-remote-code \
--max-model-len 262144 \
--enable-auto-tool-choice \
--tool-call-parser poolside_v1 \
--reasoning-parser poolside_v1 \
--host 0.0.0.0 --port 8000
Controlling reasoning
Reasoning is off by default in the chat template. Enable it per-request:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="poolside/Laguna-XS-2.1",
messages=[{"role": "user", "content": "Write a Python retry wrapper with exponential backoff."}],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
temperature=0.7,
top_p=1.0,
extra_query={"top_k": 20},
)
print(resp.choices[0].message.reasoning_content)
print(resp.choices[0].message.content)
Or default-on with --default-chat-template-kwargs '{"enable_thinking": true}'.
Speculative decoding (DFlash)
Enable the Spec Decoding toggle above to attach Poolside's DFlash draft model — a 5-layer Llama-style speculator that proposes up to 15 tokens per step. Reported end-to-end speedup is 1.67x–2.64x across evaluation datasets, with a mean accepted length of 3.55–4.57 tokens per step.
Requires:
- vLLM built from PR #46853 (adds DFlash drafter support for Laguna-XS-2.1) — not yet in a stable release.
VLLM_USE_DEEP_GEMM=0in the launch environment — DeepGEMM is currently incompatible with the DFlash draft path.
Example:
VLLM_USE_DEEP_GEMM=0 vllm serve poolside/Laguna-XS-2.1 \
--trust-remote-code \
--max-model-len 16384 \
--enable-auto-tool-choice \
--tool-call-parser poolside_v1 \
--reasoning-parser poolside_v1 \
--speculative-config '{"model":"poolside/Laguna-XS-2.1-DFlash","num_speculative_tokens":15,"method":"dflash"}'