vLLM/Recipes
Jina AI

jinaai/jina-embeddings-v5-text-small

Jina AI's fifth-gen multilingual text embedding model (677M, Qwen3-0.6B-Base) with task-specific LoRA adapters for retrieval, text-matching, classification, and clustering.

71.7 MTEB English v2 / 67.7 MMTEB at <1B params, 119+ languages, 32K context

View on HuggingFace
dense0.7B32,768 ctxvLLM 0.20.0+embedding
Guide

Overview

jina-embeddings-v5-text-small is the fifth generation of Jina AI's multilingual text embedding family, released February 18, 2026. It scores 71.7 on MTEB English v2 and 67.7 on MMTEB with only 677M parameters — the highest among multilingual embedding models under 1B — and supports 119+ languages with up to 32K-token context. Built on Qwen3-0.6B-Base and trained by distilling Qwen3-Embedding-4B plus task-specific contrastive losses, it produces 1024-dim embeddings that stay robust under truncation (Matryoshka dims: 32–1024) and binary quantization.

vLLM support landed in PR #39575 via the JinaEmbeddingsV5Model architecture and the --runner pooling API.

Task-specific adapters

v5 ships four LoRA adapters — one per supported task. For each task, Jina AI publishes a sibling repo with that adapter pre-merged into the base weights; these are the simplest path for vLLM and what this recipe serves. Pick a task above; the recipe swaps the model id accordingly:

TaskPre-merged repo
Retrievaljinaai/jina-embeddings-v5-text-small-retrieval
Text-matchingjinaai/jina-embeddings-v5-text-small-text-matching
Classificationjinaai/jina-embeddings-v5-text-small-classification
Clusteringjinaai/jina-embeddings-v5-text-small-clustering

Prerequisites

  • Hardware: any single GPU with ≥ 2 GB VRAM (T4 / L4 / A10 / A100 / H100 / H200 / B200 / MI300X all fine — bf16 weights are ~1.4 GB).

  • vLLM: requires a build that includes PR #39575. Use the nightly wheel until the next stable release ships:

    uv pip install -U vllm --pre \
      --extra-index-url https://wheels.vllm.ai/nightly \
      --index-strategy unsafe-best-match
    

Launch command

Use the launch command above (it points at the chosen task's pre-merged repo). The full form looks like:

vllm serve jinaai/jina-embeddings-v5-text-small-retrieval \
  --trust-remote-code \
  --runner pooling \
  --host 0.0.0.0 --port 8000

Swap the model id for -text-matching, -classification, or -clustering to serve the corresponding task — or just toggle the Variant pill above.

Alternative: base repo + --hf-overrides

If you'd rather download a single checkpoint and switch tasks at startup, serve the base jinaai/jina-embeddings-v5-text-small repo and pass the task via --hf-overrides:

vllm serve jinaai/jina-embeddings-v5-text-small \
  --trust-remote-code \
  --runner pooling \
  --hf-overrides '{"jina_task": "retrieval"}'

Allowed values: retrieval, text-matching, classification, clustering. This loads the base weights and merges the requested adapter at startup.

Client usage

Embeddings (/v1/embeddings)

curl -s http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jinaai/jina-embeddings-v5-text-small-retrieval",
    "input": ["Query: What is climate change?"]
  }' | python3 -m json.tool

Python (OpenAI SDK)

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.embeddings.create(
    model="jinaai/jina-embeddings-v5-text-small-retrieval",
    input=[
        "Climate change has led to rising sea levels.",
        "Overview of climate change impacts on coastal cities.",
    ],
)
for d in resp.data:
    print(d.index, len(d.embedding))

Multilingual text-matching example

Switch the Variant pill to text-matching (or serve the -text-matching repo) and embed semantically equivalent strings across languages:

texts = [
    "غروب جميل على الشاطئ",
    "海滩上美丽的日落",
    "A beautiful sunset over the beach",
    "Un beau coucher de soleil sur la plage",
    "浜辺に沈む美しい夕日",
]
resp = client.embeddings.create(
    model="jinaai/jina-embeddings-v5-text-small-text-matching",
    input=texts,
)

Configuration tips

  • Pick the task that matches your workload — retrieval prompts (query vs. document) are baked into the retrieval adapter, so the wrong task measurably degrades recall.
  • Matryoshka truncation: embeddings stay useful when truncated to 32 / 64 / 128 / 256 / 512 / 768 dims — keep the prefix and renormalize.
  • Throughput: with TP=1 on a single small GPU, the bottleneck is usually tokenization — batch your inputs (input: [...] with up to a few hundred short docs per request).
  • bf16 vs fp16: README recommends bf16 on modern GPUs; PR #39575's test used fp16. Either dtype works; bf16 is more numerically stable on Hopper / Blackwell / MI300X+.

References