jinaai/jina-embeddings-v5-text-small
Jina AI's fifth-gen multilingual text embedding model (677M, Qwen3-0.6B-Base) with task-specific LoRA adapters for retrieval, text-matching, classification, and clustering.
71.7 MTEB English v2 / 67.7 MMTEB at <1B params, 119+ languages, 32K context
View on HuggingFaceGuide
Overview
jina-embeddings-v5-text-small
is the fifth generation of Jina AI's multilingual text embedding family, released
February 18, 2026. It scores 71.7 on MTEB English v2 and 67.7 on MMTEB with
only 677M parameters — the highest among multilingual embedding models under 1B —
and supports 119+ languages with up to 32K-token context. Built on
Qwen3-0.6B-Base and trained by distilling Qwen3-Embedding-4B plus task-specific
contrastive losses, it produces 1024-dim embeddings that stay robust under
truncation (Matryoshka dims: 32–1024) and binary quantization.
vLLM support landed in PR #39575
via the JinaEmbeddingsV5Model architecture and the --runner pooling API.
Task-specific adapters
v5 ships four LoRA adapters — one per supported task. For each task, Jina AI publishes a sibling repo with that adapter pre-merged into the base weights; these are the simplest path for vLLM and what this recipe serves. Pick a task above; the recipe swaps the model id accordingly:
| Task | Pre-merged repo |
|---|---|
| Retrieval | jinaai/jina-embeddings-v5-text-small-retrieval |
| Text-matching | jinaai/jina-embeddings-v5-text-small-text-matching |
| Classification | jinaai/jina-embeddings-v5-text-small-classification |
| Clustering | jinaai/jina-embeddings-v5-text-small-clustering |
Prerequisites
-
Hardware: any single GPU with ≥ 2 GB VRAM (T4 / L4 / A10 / A100 / H100 / H200 / B200 / MI300X all fine — bf16 weights are ~1.4 GB).
-
vLLM: requires a build that includes PR #39575. Use the nightly wheel until the next stable release ships:
uv pip install -U vllm --pre \ --extra-index-url https://wheels.vllm.ai/nightly \ --index-strategy unsafe-best-match
Launch command
Use the launch command above (it points at the chosen task's pre-merged repo). The full form looks like:
vllm serve jinaai/jina-embeddings-v5-text-small-retrieval \
--trust-remote-code \
--runner pooling \
--host 0.0.0.0 --port 8000
Swap the model id for -text-matching, -classification, or -clustering to
serve the corresponding task — or just toggle the Variant pill above.
Alternative: base repo + --hf-overrides
If you'd rather download a single checkpoint and switch tasks at startup, serve
the base jinaai/jina-embeddings-v5-text-small repo and pass the task via
--hf-overrides:
vllm serve jinaai/jina-embeddings-v5-text-small \
--trust-remote-code \
--runner pooling \
--hf-overrides '{"jina_task": "retrieval"}'
Allowed values: retrieval, text-matching, classification, clustering.
This loads the base weights and merges the requested adapter at startup.
Client usage
Embeddings (/v1/embeddings)
curl -s http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "jinaai/jina-embeddings-v5-text-small-retrieval",
"input": ["Query: What is climate change?"]
}' | python3 -m json.tool
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.embeddings.create(
model="jinaai/jina-embeddings-v5-text-small-retrieval",
input=[
"Climate change has led to rising sea levels.",
"Overview of climate change impacts on coastal cities.",
],
)
for d in resp.data:
print(d.index, len(d.embedding))
Multilingual text-matching example
Switch the Variant pill to text-matching (or serve the
-text-matching repo) and embed semantically equivalent strings across
languages:
texts = [
"غروب جميل على الشاطئ",
"海滩上美丽的日落",
"A beautiful sunset over the beach",
"Un beau coucher de soleil sur la plage",
"浜辺に沈む美しい夕日",
]
resp = client.embeddings.create(
model="jinaai/jina-embeddings-v5-text-small-text-matching",
input=texts,
)
Configuration tips
- Pick the task that matches your workload — retrieval prompts (
queryvs.document) are baked into the retrieval adapter, so the wrong task measurably degrades recall. - Matryoshka truncation: embeddings stay useful when truncated to 32 / 64 / 128 / 256 / 512 / 768 dims — keep the prefix and renormalize.
- Throughput: with TP=1 on a single small GPU, the bottleneck is usually
tokenization — batch your inputs (
input: [...]with up to a few hundred short docs per request). - bf16 vs fp16: README recommends bf16 on modern GPUs; PR #39575's test used fp16. Either dtype works; bf16 is more numerically stable on Hopper / Blackwell / MI300X+.