vLLM/Recipes
LiquidAI

LiquidAI/LFM2.5-1.2B-Instruct

Liquid AI's 1.2B instruction-tuned model on the LFM2 hybrid conv+attention backbone, with tool calling and a 128K context window on a single small GPU.

1.2B hybrid (gated conv + GQA) instruct model with tool calling — runs on a single 8 GB+ GPU

dense1.2B128,000 ctxvLLM 0.23.0+text
Guide

Overview

LFM2.5-1.2B-Instruct is a 1.2B instruction-tuned model from Liquid AI, built on the LFM2 hybrid backbone: short-range gated convolution blocks interleaved with grouped-query attention. The hybrid design keeps the KV cache small and decode fast, so the model serves comfortably on a single small GPU while supporting a 128K context window and Pythonic tool calling — all through vLLM's OpenAI-compatible API.

Key Features

  • Hybrid backbone: Gated short convolutions interleaved with grouped-query attention — a smaller KV cache and lower decode latency than a same-size full-attention transformer.
  • 128K context: Long-context support (max_position_embeddings = 128000).
  • Tool calling: Pythonic tool calls (<|tool_call_start|>…<|tool_call_end|>) surfaced as OpenAI tool_calls by vLLM's native lfm2 parser.
  • Native vLLM support: Served via the Lfm2ForCausalLM architecture — no --trust-remote-code required.
  • Edge-ready family: Day-one support across llama.cpp, MLX, ONNX, vLLM, and SGLang.

Supported Variants

Dense:

  • LiquidAI/LFM2.5-350M (350M)
  • LiquidAI/LFM2.5-1.2B-Instruct (1.2B)
  • LiquidAI/LFM2.5-1.2B-Thinking (1.2B, reasoning)
  • LiquidAI/LFM2.5-1.2B-JP / LiquidAI/LFM2.5-1.2B-JP-202606 (Japanese)
  • LiquidAI/LFM2.5-1.2B-Base (pretrained base)

MoE:

  • LiquidAI/LFM2.5-8B-A1B (8B total / ~1B active)

Vision-Language:

  • LiquidAI/LFM2.5-VL-450M, LiquidAI/LFM2.5-VL-1.6B

See the LFM2.5 usage guide for the full family.

Prerequisites

  • Hardware: 1× GPU with ≥8 GB VRAM (e.g. L4, A10, RTX 4090, H100). Verified on H100.
  • vLLM: ≥ 0.23.0 — the LFM2 architecture ships in the 0.23.0 stable release.

pip (NVIDIA CUDA)

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Deployment Configurations

Quick Start (Single GPU, BF16)

vllm serve LiquidAI/LFM2.5-1.2B-Instruct

Cap the context to fit a smaller GPU (the model supports up to 128K):

vllm serve LiquidAI/LFM2.5-1.2B-Instruct --max-model-len 32768

Enables tool calling:

vllm serve LiquidAI/LFM2.5-1.2B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser lfm2 \
  --host 0.0.0.0 --port 8000

Docker (NVIDIA)

docker run -itd --name lfm2.5-1.2b \
  --ipc=host --network host --shm-size 16G --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
    --model LiquidAI/LFM2.5-1.2B-Instruct \
    --host 0.0.0.0 --port 8000

Client Usage

Text Generation

The model card recommends temperature 0.1, top_k 50, repetition_penalty 1.05. top_k and repetition_penalty are vLLM extra sampling params — pass them via extra_body.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Instruct",
    messages=[{"role": "user", "content": "What is C. elegans? Answer in one sentence."}],
    temperature=0.1,
    extra_body={"top_k": 50, "repetition_penalty": 1.05},
)
print(response.choices[0].message.content)

Tool Calling

Launch with --enable-auto-tool-choice --tool-call-parser lfm2, then pass tools=[…]. The lfm2 parser converts the model's Pythonic call into a standard tool_calls array.

response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Instruct",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {"type": "object", "properties": {"location": {"type": "string"}}, "required": ["location"]},
        },
    }],
    temperature=0.1,
    extra_body={"top_k": 50, "repetition_penalty": 1.05},
)
print(response.choices[0].message.tool_calls)

Structured Outputs

vLLM's guided decoding constrains output to a JSON schema via response_format. The model does not see the schema itself — put semantic instructions (units, formatting) in the system prompt.

Configuration Tips

  • Set --max-model-len to match your workload (up to 128K); lowering it frees VRAM for KV cache.
  • --gpu-memory-utilization 0.90–0.95 maximizes KV cache capacity.
  • FP8 KV cache (--kv-cache-dtype fp8) roughly halves KV memory.
  • Sampling presets are per-request client defaults — don't bake them into vllm serve.

References