LiquidAI/LFM2.5-350M
Liquid AI's smallest LFM2.5 chat model (350M) on the LFM2 hybrid conv+attention backbone — tool calling and 128K context, light enough for edge GPUs.
350M hybrid instruct model with tool calling — fits any GPU, ideal for edge / on-device serving
Guide
Overview
LFM2.5-350M is the smallest chat model in Liquid AI's LFM2.5 family, built on the LFM2 hybrid backbone: short-range gated convolution blocks interleaved with grouped-query attention. It shares that backbone with its larger siblings, which keeps memory and latency low enough for edge and on-device deployment while still supporting tool calling and a 128K context window — all through vLLM's OpenAI-compatible API.
Key Features
- Hybrid backbone: Gated short convolutions interleaved with grouped-query attention — a smaller KV cache and lower decode latency than a same-size full-attention transformer.
- 128K context: Long-context support (
max_position_embeddings = 128000). - Tool calling: Pythonic tool calls (
<|tool_call_start|>…<|tool_call_end|>) surfaced as OpenAItool_callsby vLLM's nativelfm2parser. - Native vLLM support: Served via the
Lfm2ForCausalLMarchitecture — no--trust-remote-coderequired. - Edge-ready: ~0.7 GB of BF16 weights — runs on commodity and on-device GPUs.
Supported Variants
Dense:
LiquidAI/LFM2.5-350M(350M)LiquidAI/LFM2.5-1.2B-Instruct(1.2B)LiquidAI/LFM2.5-1.2B-Thinking(1.2B, reasoning)LiquidAI/LFM2.5-1.2B-JP/LiquidAI/LFM2.5-1.2B-JP-202606(Japanese)LiquidAI/LFM2.5-1.2B-Base(pretrained base)
MoE:
LiquidAI/LFM2.5-8B-A1B(8B total / ~1B active)
Vision-Language:
LiquidAI/LFM2.5-VL-450M,LiquidAI/LFM2.5-VL-1.6B
See the LFM2.5 usage guide for the full family.
Prerequisites
- Hardware: 1× GPU (~1 GB VRAM for weights; any modern GPU works). Verified on H100.
- vLLM: ≥ 0.23.0 — the LFM2 architecture ships in the 0.23.0 stable release.
pip (NVIDIA CUDA)
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Deployment Configurations
Quick Start (Single GPU, BF16)
vllm serve LiquidAI/LFM2.5-350M
Full-Featured Server Launch
Enables tool calling:
vllm serve LiquidAI/LFM2.5-350M \
--enable-auto-tool-choice \
--tool-call-parser lfm2 \
--host 0.0.0.0 --port 8000
Docker (NVIDIA)
docker run -itd --name lfm2.5-350m \
--ipc=host --network host --shm-size 16G --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model LiquidAI/LFM2.5-350M \
--host 0.0.0.0 --port 8000
Client Usage
Text Generation
The model card recommends temperature 0.1, top_k 50, repetition_penalty 1.05 (top_k and
repetition_penalty ride in extra_body).
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="LiquidAI/LFM2.5-350M",
messages=[{"role": "user", "content": "Give me three uses for baking soda."}],
temperature=0.1,
extra_body={"top_k": 50, "repetition_penalty": 1.05},
)
print(response.choices[0].message.content)
Tool Calling
Launch with --enable-auto-tool-choice --tool-call-parser lfm2, then pass tools=[…]; the
lfm2 parser converts the model's Pythonic call into a standard tool_calls array.
response = client.chat.completions.create(
model="LiquidAI/LFM2.5-350M",
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {"type": "object", "properties": {"location": {"type": "string"}}, "required": ["location"]},
},
}],
temperature=0.1,
extra_body={"top_k": 50, "repetition_penalty": 1.05},
)
print(response.choices[0].message.tool_calls)
Structured Outputs
vLLM's guided decoding constrains output to a JSON schema via response_format. The model does
not see the schema itself — put semantic instructions in the system prompt.
Configuration Tips
- At 350M the model fits any GPU; raise
--max-num-seqsto push batch throughput. - Set
--max-model-lento match your workload (up to 128K). --gpu-memory-utilization 0.90–0.95maximizes KV cache capacity.- Sampling presets are per-request client defaults — don't bake them into
vllm serve.