vLLM/Recipes
LiquidAI

LiquidAI/LFM2.5-VL-450M

Liquid AI's smallest vision-language model (450M) — LFM2 hybrid LM backbone plus a SigLIP2 vision tower for image+text chat, light enough for edge GPUs.

450M vision-language model (hybrid LM + SigLIP2 vision) — image understanding light enough for edge / on-device serving

dense450M128,000 ctxvLLM 0.23.0+multimodal
Guide

Overview

LFM2.5-VL-450M is the smallest vision-language model in Liquid AI's LFM2.5-VL family — a SigLIP2 vision encoder on top of the LFM2 hybrid language backbone (short-range gated convolution blocks interleaved with grouped-query attention). It is light enough for edge and on-device image understanding, served through vLLM's OpenAI-compatible API.

Key Features

  • Vision-language: SigLIP2 vision encoder on top of the LFM2 hybrid language backbone — single- and multi-image prompts.
  • Edge-ready: ~1 GB of BF16 weights — runs on commodity and on-device GPUs.
  • Hybrid LM backbone: Gated short convolutions interleaved with grouped-query attention — a smaller KV cache and lower decode latency than a same-size full-attention transformer.
  • 128K context: Long-context support (text_config.max_position_embeddings = 128000).
  • Native vLLM support: Served via the Lfm2VlForConditionalGeneration architecture — no --trust-remote-code required.

Supported Variants

Vision-Language:

  • LiquidAI/LFM2.5-VL-450M (450M)
  • LiquidAI/LFM2.5-VL-1.6B (1.6B)

Text (same LFM2 family):

  • Dense: LiquidAI/LFM2.5-350M, LiquidAI/LFM2.5-1.2B-Instruct, LiquidAI/LFM2.5-1.2B-Thinking, LiquidAI/LFM2.5-1.2B-JP, LiquidAI/LFM2.5-1.2B-JP-202606, LiquidAI/LFM2.5-1.2B-Base
  • MoE: LiquidAI/LFM2.5-8B-A1B

See the LFM2.5 usage guide for the full family.

Prerequisites

  • Hardware: 1× GPU (~1–2 GB VRAM for weights; any modern GPU works). Verified on H100.
  • vLLM: ≥ 0.23.0 — the Lfm2VlForConditionalGeneration architecture ships in the 0.23.0 stable release.

pip (NVIDIA CUDA)

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

Deployment Configurations

Quick Start (Single GPU, BF16)

vllm serve LiquidAI/LFM2.5-VL-450M

Multiple Images per Request

vllm serve LiquidAI/LFM2.5-VL-450M \
  --limit-mm-per-prompt '{"image": 4}' \
  --host 0.0.0.0 --port 8000

Docker (NVIDIA)

docker run -itd --name lfm2.5-vl-450m \
  --ipc=host --network host --shm-size 16G --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
    --model LiquidAI/LFM2.5-VL-450M \
    --host 0.0.0.0 --port 8000

Client Usage

Image Understanding

Send an image + text turn via the OpenAI chat API. The card recommends temperature 0.1, min_p 0.15, repetition_penalty 1.05 (min_p and repetition_penalty ride in extra_body).

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-VL-450M",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
            {"type": "text", "text": "What is in this image?"},
        ],
    }],
    temperature=0.1,
    extra_body={"min_p": 0.15, "repetition_penalty": 1.05},
)
print(response.choices[0].message.content)

Multiple Images

Launch with --limit-mm-per-prompt '{"image": N}', then include several image_url blocks in one message to compare or reason across images.

Configuration Tips

  • At 450M the model fits any GPU; ideal for edge / on-device image understanding.
  • --limit-mm-per-prompt '{"image": N}' caps images per request (default 1).
  • Set --max-model-len to match your workload (up to 128K).
  • Sampling presets are per-request client defaults — don't bake them into vllm serve.

References