vLLM/Recipes
LiquidAI

LiquidAI/LFM2.5-1.2B-Base

Liquid AI's 1.2B pretrained base model on the LFM2 hybrid conv+attention backbone — a text-completion and fine-tuning foundation (no chat template).

1.2B pretrained base (no chat template) — completions endpoint, ideal as a fine-tuning base

dense1.2B128,000 ctxvLLM 0.23.0+text
Guide

Overview

LFM2.5-1.2B-Base is the pretrained base checkpoint behind Liquid AI's LFM2.5-1.2B instruct models, built on the LFM2 hybrid backbone (short-range gated convolution blocks interleaved with grouped-query attention). It is not instruction-tuned and has no chat template — use the /v1/completions endpoint (raw text continuation), or fine-tune it for your own task.

Key Features

  • Pretrained base: No chat template, no instruction tuning — a clean foundation for fine-tuning or raw completion.
  • Hybrid backbone: Gated short convolutions interleaved with grouped-query attention — a smaller KV cache and lower decode latency than a same-size full-attention transformer.
  • 128K context: Long-context support (max_position_embeddings = 128000).
  • Native vLLM support: Served via the Lfm2ForCausalLM architecture — no --trust-remote-code required.

Supported Variants

Dense:

  • LiquidAI/LFM2.5-350M (350M)
  • LiquidAI/LFM2.5-1.2B-Instruct (1.2B)
  • LiquidAI/LFM2.5-1.2B-Thinking (1.2B, reasoning)
  • LiquidAI/LFM2.5-1.2B-JP / LiquidAI/LFM2.5-1.2B-JP-202606 (Japanese)
  • LiquidAI/LFM2.5-1.2B-Base (pretrained base)

MoE:

  • LiquidAI/LFM2.5-8B-A1B (8B total / ~1B active)

Vision-Language:

  • LiquidAI/LFM2.5-VL-450M, LiquidAI/LFM2.5-VL-1.6B

See the LFM2.5 usage guide for the full family.

Prerequisites

  • Hardware: 1× GPU with ≥8 GB VRAM. Verified on H100.
  • vLLM: ≥ 0.23.0 — the LFM2 architecture ships in the 0.23.0 stable release.

pip (NVIDIA CUDA)

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Deployment Configurations

Quick Start (Single GPU, BF16)

vllm serve LiquidAI/LFM2.5-1.2B-Base

Docker (NVIDIA)

docker run -itd --name lfm2.5-base \
  --ipc=host --network host --shm-size 16G --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
    --model LiquidAI/LFM2.5-1.2B-Base \
    --host 0.0.0.0 --port 8000

Client Usage

Text Completion

This is a base model — use the completions endpoint, not chat. Recommended sampling: temperature 0.3, min_p 0.15, repetition_penalty 1.05.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Base",
    prompt="The three laws of thermodynamics are:",
    max_tokens=128,
    temperature=0.3,
    extra_body={"min_p": 0.15, "repetition_penalty": 1.05},
)
print(response.choices[0].text)

Configuration Tips

  • Use /v1/completions (raw continuation) — there is no chat template on a base model.
  • For fine-tuning, this checkpoint is the recommended starting point for custom LFM2.5-1.2B tasks.
  • Set --max-model-len to match your workload (up to 128K).
  • Sampling presets are per-request client defaults — don't bake them into vllm serve.

References