LiquidAI/LFM2.5-8B-A1B

Liquid AI's 8B mixture-of-experts model (~1B active) on the LFM2 hybrid conv+attention backbone, with <think> reasoning and tool calling at ~1B decode cost.

8B-total / ~1B-active hybrid MoE with reasoning and tool calling — 8B quality at ~1B decode cost on a single GPU

View on HuggingFace

moe8B / 1B128,000 ctxvLLM 0.23.0+text

Guide

Overview

LFM2.5-8B-A1B is Liquid AI's mixture-of-experts model: ~8B total parameters with only ~1B activated per token. It pairs the LFM2 hybrid backbone (short-range gated convolution blocks interleaved with grouped-query attention) with sparse MoE feed-forward layers, so it reaches 8B-class quality at roughly 1B decode cost. It supports explicit <think> reasoning and Pythonic tool calling through vLLM's OpenAI-compatible API.

Key Features

Hybrid + MoE: Gated short convolutions + grouped-query attention with sparse expert FFNs — ~1B active parameters per token out of ~8B total.
<think> reasoning: Emits an explicit <think>…</think> chain-of-thought on non-trivial problems; vLLM's qwen3 parser splits it into reasoning_content.
Tool calling: Pythonic tool calls surfaced as OpenAI tool_calls by vLLM's native lfm2 parser.
128K context: Long-context support (max_position_embeddings = 128000).
Native vLLM support: Served via the Lfm2MoeForCausalLM architecture — no --trust-remote-code required.

Supported Variants

Dense:

LiquidAI/LFM2.5-350M (350M)
LiquidAI/LFM2.5-1.2B-Instruct (1.2B)
LiquidAI/LFM2.5-1.2B-Thinking (1.2B, reasoning)
LiquidAI/LFM2.5-1.2B-JP / LiquidAI/LFM2.5-1.2B-JP-202606 (Japanese)
LiquidAI/LFM2.5-1.2B-Base (pretrained base)

MoE:

LiquidAI/LFM2.5-8B-A1B (8B total / ~1B active)

Vision-Language:

LiquidAI/LFM2.5-VL-450M, LiquidAI/LFM2.5-VL-1.6B

See the LFM2.5 usage guide for the full family.

Sizing: MoE keeps every expert resident in VRAM, so size the GPU for the full ~8B of weights (≈17 GB BF16 + KV cache) even though only ~1B is active per token.

Prerequisites

Hardware: 1× GPU with ≥24 GB VRAM. Verified on H100.
vLLM: ≥ 0.23.0 — the Lfm2MoeForCausalLM architecture ships in the 0.23.0 stable release.

pip (NVIDIA CUDA)

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Deployment Configurations

Quick Start (Single GPU, BF16)

vllm serve LiquidAI/LFM2.5-8B-A1B

Full-Featured Server Launch

Reasoning and tool calling:

vllm serve LiquidAI/LFM2.5-8B-A1B \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser lfm2 \
  --host 0.0.0.0 --port 8000

Multi-GPU (Expert Parallelism)

Split the experts across GPUs on a multi-GPU node:

vllm serve LiquidAI/LFM2.5-8B-A1B \
  --tensor-parallel-size 2 \
  --enable-expert-parallel

Docker (NVIDIA)

docker run -itd --name lfm2.5-8b-a1b \
  --ipc=host --network host --shm-size 16G --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
    --model LiquidAI/LFM2.5-8B-A1B \
    --reasoning-parser qwen3 \
    --host 0.0.0.0 --port 8000

Client Usage

Reasoning Mode

The model card recommends temperature 0.2, top_k 80, repetition_penalty 1.05. Give the <think> block room — don't cap max_tokens too low.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-8B-A1B",
    messages=[{"role": "user", "content": "A snail climbs 3 ft up a 20 ft well each day and slides 2 ft back each night. How many days to reach the top?"}],
    temperature=0.2,
    max_tokens=2048,
    extra_body={"top_k": 80, "repetition_penalty": 1.05},
)
msg = response.choices[0].message
print("reasoning:", msg.reasoning_content)
print("answer:", msg.content)

Note: the model opens the <think> channel for non-trivial problems; a simple prompt may be answered directly, in which case reasoning_content is empty. That is expected behavior — the qwen3 parser extracts the block whenever it is present.

Tool Calling

Add --enable-auto-tool-choice --tool-call-parser lfm2 at launch, then pass tools=[…]; the lfm2 parser converts the model's Pythonic call into a standard tool_calls array.

Structured Outputs

vLLM's guided decoding constrains output to a JSON schema via response_format. The model does not see the schema itself — put semantic instructions in the system prompt.

Configuration Tips

Size the GPU for the full ~8B of weights — all experts are resident even though ~1B is active.
For multi-GPU nodes, --enable-expert-parallel distributes experts across ranks.
Set --max-model-len to match your workload (up to 128K).
--gpu-memory-utilization 0.90–0.95 maximizes KV cache capacity.
Sampling presets are per-request client defaults — don't bake them into vllm serve.