LiquidAI/LFM2.5-VL-1.6B
Liquid AI's 1.6B vision-language model — LFM2 hybrid LM backbone plus a SigLIP2 vision tower for image+text chat on a single small GPU.
1.6B vision-language model (hybrid LM + SigLIP2 vision) — image understanding on a single small GPU
Guide
Overview
LFM2.5-VL-1.6B is the larger vision-language model in Liquid AI's LFM2.5-VL family. It pairs the LFM2 hybrid language backbone (short-range gated convolution blocks interleaved with grouped-query attention) with a SigLIP2 vision encoder for image understanding, while remaining small enough to serve on a single commodity GPU through vLLM's OpenAI-compatible API.
Key Features
- Vision-language: SigLIP2 vision encoder on top of the LFM2 hybrid language backbone — single- and multi-image prompts.
- Hybrid LM backbone: Gated short convolutions interleaved with grouped-query attention — a smaller KV cache and lower decode latency than a same-size full-attention transformer.
- 128K context: Long-context support (
text_config.max_position_embeddings = 128000). - Native vLLM support: Served via the
Lfm2VlForConditionalGenerationarchitecture — no--trust-remote-coderequired.
Supported Variants
Vision-Language:
LiquidAI/LFM2.5-VL-450M(450M)LiquidAI/LFM2.5-VL-1.6B(1.6B)
Text (same LFM2 family):
- Dense:
LiquidAI/LFM2.5-350M,LiquidAI/LFM2.5-1.2B-Instruct,LiquidAI/LFM2.5-1.2B-Thinking,LiquidAI/LFM2.5-1.2B-JP,LiquidAI/LFM2.5-1.2B-JP-202606,LiquidAI/LFM2.5-1.2B-Base - MoE:
LiquidAI/LFM2.5-8B-A1B
See the LFM2.5 usage guide for the full family.
Prerequisites
- Hardware: 1× GPU with ≥8 GB VRAM. Verified on H100.
- vLLM: ≥ 0.23.0 — the
Lfm2VlForConditionalGenerationarchitecture ships in the 0.23.0 stable release.
pip (NVIDIA CUDA)
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
Deployment Configurations
Quick Start (Single GPU, BF16)
vllm serve LiquidAI/LFM2.5-VL-1.6B
Multiple Images per Request
vllm serve LiquidAI/LFM2.5-VL-1.6B \
--limit-mm-per-prompt '{"image": 4}' \
--host 0.0.0.0 --port 8000
Docker (NVIDIA)
docker run -itd --name lfm2.5-vl-1.6b \
--ipc=host --network host --shm-size 16G --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model LiquidAI/LFM2.5-VL-1.6B \
--host 0.0.0.0 --port 8000
Client Usage
Image Understanding
Send an image + text turn via the OpenAI chat API. The card recommends temperature 0.1,
min_p 0.15, repetition_penalty 1.05 (min_p and repetition_penalty ride in extra_body).
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="LiquidAI/LFM2.5-VL-1.6B",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
{"type": "text", "text": "What is in this image?"},
],
}],
temperature=0.1,
extra_body={"min_p": 0.15, "repetition_penalty": 1.05},
)
print(response.choices[0].message.content)
Multiple Images
Launch with --limit-mm-per-prompt '{"image": N}', then include several image_url blocks in
one message to compare or reason across images.
Configuration Tips
--limit-mm-per-prompt '{"image": N}'caps images per request (default 1).- Set
--max-model-lento match your workload (up to 128K). --gpu-memory-utilization 0.90–0.95maximizes KV cache capacity.- Sampling presets are per-request client defaults — don't bake them into
vllm serve.