LiquidAI/LFM2.5-1.2B-Thinking
Liquid AI's 1.2B reasoning model on the LFM2 hybrid conv+attention backbone — <think> chain-of-thought, tool calling, and 128K context on a single small GPU.
1.2B hybrid reasoning model with <think> chain-of-thought and tool calling — single small GPU
Guide
Overview
LFM2.5-1.2B-Thinking is the reasoning
variant of Liquid AI's 1.2B LFM2.5 model, built on the LFM2 hybrid
backbone (short-range gated convolution blocks interleaved with grouped-query attention). On
reasoning-heavy tasks it produces an explicit <think>…</think> chain-of-thought before its
final answer, so it reasons well above its weight class while still serving on a single small
GPU through vLLM's OpenAI-compatible API.
Key Features
- Hybrid backbone: Gated short convolutions interleaved with grouped-query attention — a smaller KV cache and lower decode latency than a same-size full-attention transformer.
<think>reasoning: Emits an explicit<think>…</think>chain-of-thought on non-trivial problems; vLLM'sqwen3parser splits it into a separatereasoning_contentfield.- 128K context: Long-context support (
max_position_embeddings = 128000). - Tool calling: Pythonic tool calls surfaced as OpenAI
tool_callsby vLLM's nativelfm2parser. - Native vLLM support: Served via the
Lfm2ForCausalLMarchitecture — no--trust-remote-coderequired.
Supported Variants
Dense:
LiquidAI/LFM2.5-350M(350M)LiquidAI/LFM2.5-1.2B-Instruct(1.2B)LiquidAI/LFM2.5-1.2B-Thinking(1.2B, reasoning)LiquidAI/LFM2.5-1.2B-JP/LiquidAI/LFM2.5-1.2B-JP-202606(Japanese)LiquidAI/LFM2.5-1.2B-Base(pretrained base)
MoE:
LiquidAI/LFM2.5-8B-A1B(8B total / ~1B active, also a reasoning model)
Vision-Language:
LiquidAI/LFM2.5-VL-450M,LiquidAI/LFM2.5-VL-1.6B
See the LFM2.5 usage guide for the full family.
Prerequisites
- Hardware: 1× GPU with ≥8 GB VRAM. Verified on H100.
- vLLM: ≥ 0.23.0 — the LFM2 architecture ships in the 0.23.0 stable release.
pip (NVIDIA CUDA)
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Deployment Configurations
Quick Start (Single GPU, BF16)
Enable the reasoning parser so the chain-of-thought is returned in reasoning_content:
vllm serve LiquidAI/LFM2.5-1.2B-Thinking \
--reasoning-parser qwen3
Full-Featured Server Launch
Reasoning and tool calling:
vllm serve LiquidAI/LFM2.5-1.2B-Thinking \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser lfm2 \
--host 0.0.0.0 --port 8000
Docker (NVIDIA)
docker run -itd --name lfm2.5-thinking \
--ipc=host --network host --shm-size 16G --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model LiquidAI/LFM2.5-1.2B-Thinking \
--reasoning-parser qwen3 \
--host 0.0.0.0 --port 8000
Client Usage
Reasoning Mode
The card recommends a low temperature for the reasoning model — temperature 0.05, top_k 50,
repetition_penalty 1.05. Do not cap max_tokens too low: it can truncate the
chain-of-thought mid-stream.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="LiquidAI/LFM2.5-1.2B-Thinking",
messages=[{"role": "user", "content": "A snail climbs 3 ft up a 20 ft well each day and slides 2 ft back each night. How many days to reach the top?"}],
temperature=0.05,
max_tokens=2048,
extra_body={"top_k": 50, "repetition_penalty": 1.05},
)
msg = response.choices[0].message
print("reasoning:", msg.reasoning_content)
print("answer:", msg.content)
Note: the model opens the
<think>channel for non-trivial problems; a simple prompt may be answered directly, in which casereasoning_contentis empty. That is expected behavior — theqwen3parser extracts the block whenever it is present.
Tool Calling
Add --enable-auto-tool-choice --tool-call-parser lfm2 at launch, then pass tools=[…]; the
model can reason about which tool to call before emitting it.
Structured Outputs
vLLM's guided decoding constrains output to a JSON schema via response_format. The model does
not see the schema itself — put semantic instructions in the system prompt.
Configuration Tips
- Give reasoning room: keep
max_tokenshigh enough for the<think>block plus the answer. - Set
--max-model-lento match your workload (up to 128K). --gpu-memory-utilization 0.90–0.95maximizes KV cache capacity.- Sampling presets are per-request client defaults — don't bake them into
vllm serve.