High-quality, LLM-distilled training data + Qwen2-0.5B yields top multilingual embeddings under 0.5B params

January 2, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.4

Cost Impact Score

0.8

Citation Count

3

Authors

Xinshuo Hu, Zifei Shan, Xinping Zhao, Zetian Sun, Zhenyu Liu, Dongfang Li, Shaolin Ye, Xinyuan Wei, Qian Chen, Baotian Hu, Haofen Wang, Jun Yu, Min Zhang

Links

Abstract / PDF

Why It Matters For Business

You can get strong multilingual retrieval and RAG performance with a compact 0.5B model by improving training data quality and using LLM-distilled synthetic examples, lowering cost vs larger models while keeping competitive accuracy.

Summary TLDR

KaLM-Embedding adapts a 0.5B decoder-only model (Qwen2-0.5B) into a multilingual embedding model. The paper’s core claim is that cleaner, more diverse training data and a few data-focused tricks beat complex architecture scaling for small models. Key ingredients: 550k persona-based synthetic examples distilled from an LLM, ranking-consistency filtering (top‑k=50), instruction prefixes for tasks, and Matryoshka multi-dim training. On MTEB the released KaLM-embedding-mini-instruct reaches avg MTEB ≈62.3 and per-language: zh 64.13, en 64.94, fr 63.08, pl 57.05 — a new high among models <1B. Code and model weights are released on Hugging Face and GitHub.

Problem Statement

Existing general embedding models focus on scale and architecture but often ignore training-data quality. False negatives and low domain diversity reduce retrieval and downstream performance. The paper asks: can better curated and LLM-distilled training data plus targeted filtering produce a stronger compact embedding model?

Main Contribution

KaLM-Embedding: a multilingual embedding model built from Qwen2-0.5B and trained with data-first methods

Persona-based synthetic data (550k examples) to increase domain and instruction diversity

Ranking consistency filtering (top-k) and many small fine-tuning datasets (70+) to reduce noise and false negatives

Instruction-prefix training and Matryoshka Representation Learning for multi-dimension embeddings

Open-source release: model and code available on Hugging Face and GitHub

Key Findings

KaLM-embedding-mini-instruct is state-of-the-art for multilingual embeddings under 1B parameters on MTEB.

NumbersMTEB avg 62.3; zh 64.13; en 64.94; fr 63.08; pl 57.05

LLM-distilled persona-based synthetic data was sizable and central to data diversity.

Numbers550k synthetic examples from Qwen2-72B-Instruct

Instruction prefixes materially improve performance.

NumbersAblation: w/o instructions zh −2.56, en −3.81 MTEB points

Ranking consistency filtering had only a small and mixed impact on the final model.

NumbersAblation: zh +0.12, en −0.75 MTEB points when removed

Matryoshka Representation Learning helps low‑dim embeddings but has minimal effect on full-dimension performance as configured.

NumbersAblation: w/o MRL zh −0.06, en +0.06 MTEB points

Results

MTEB avg (multilingual, model <1B)

Value62.3

Baselinecompeting <1B models listed (e.g., gte 60.53, bge-m3 59.95)

MTEB (Chinese)

Value64.13

Baselinegte-multilingual-base 62.72

MTEB (English)

Value64.94

Baselinejina-embeddings-v3 65.51 or e5-mistral-7b 66.63 for larger models

Who Should Care

What To Try In 7 Days

Test KaLM-embedding-mini-instruct from Hugging Face in your RAG stack for multilingual retrieval.

Create ~10–100k persona-conditioned synthetic pairs from an LLM and add them to a small fine-tune set.

Add clear instruction prefixes to your embedding queries and re-evaluate retrieval accuracy.

Optimization Features

Infra Optimization

  • Trained on Ascend 910B NPUs (6 nodes × 8 NPUs for pretrain; 3 nodes for fine-tune)

Training Optimization

  • Ranking consistency filtering (top-k=50)
  • Semi-homogeneous task batching (analyzed but not used in final model)
  • Matryoshka Representation Learning for multi-dim vectors
  • Instruction-prefix training

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Weaker Polish performance (MTEB pl 57.05) due to low Polish presence in training data
  • Semi-homogeneous batching was analyzed but not used in final release
  • Single-vector long-text encoding remains problematic; authors recommend multi-vector strategies
  • Pre-training and fine-tuning used only 1 epoch each, so effects of longer training are unexplored

When Not To Use

  • When you need best-in-class large-model embeddings (models >1B may still beat KaLM on some English STS and summarization tasks)
  • When your use-case requires single-vector long-text embedding for very long documents
  • If your target language has little or no coverage in the fine-tune/synthetic data (e.g., Polish)

Failure Modes

  • False negatives from in-batch or hard negatives cause degraded training, especially with large homogeneous batches
  • Overreliance on synthetic instructions might bias behavior if synthetic data diverges from real queries
  • Merged models or naive parameter averaging across task types can fail badly (authors observed unusable merged models)

Core Entities

Models

  • KaLM-embedding-mini-instruct (Qwen2-0.5B adapted)
  • Qwen2-0.5B
  • multilingual-e5-large
  • bge-m3
  • gte-multilingual-base
  • paraphrase-multilingual-mpnet-base-v2
  • jina-embeddings-v3
  • Cohere-embed-multilingual-v3.0

Metrics

  • MTEB avg
  • MTEB (zh)
  • MTEB (en)
  • MTEB (fr)
  • MTEB (pl)

Datasets

  • MTEB (Massive Text Embedding Benchmark)
  • MSMARCO
  • SQuAD 2.0
  • Natural Questions
  • ArXiv QA
  • Wikipedia
  • CC-News

Benchmarks

  • MTEB
  • C-MTEB (Chinese MTEB subset)
  • PL-MTEB (Polish)