Automated candidate labeling plus query-only PEFT lets you adapt search per tenant without re-indexing

January 8, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Prateek Jain, Shabari S Nair, Ritesh Goru, Prakhar Agarwal, Ajay Yadav, Yoga Sri Varshan Varadharajan, Constantine Caramanis

Links

Abstract / PDF

Why It Matters For Business

You can adapt search models per tenant without re-indexing documents, cutting compute, operational risk, and per-tenant storage by orders of magnitude while keeping retrieval quality close to joint fine-tuning.

Summary TLDR

This paper builds DevRev-Search, an automatically labeled enterprise retrieval benchmark, and shows you can adapt search per-tenant by fine-tuning only the query encoder. Data is produced by unioning seven retrievers and filtering candidates with an LLM judge. Freezing the document index (no re-indexing) and using parameter-efficient tuning (LoRA, small FFNs, or few unfrozen layers) matches or closely approaches full bi-encoder tuning on DevRev-Search, SciFact, and FiQA-2018. Key concrete wins: a high-coverage candidate pool (union recall per query up to 420 candidates), LoRA at practical ranks (32–64) gives near-full performance while using ~1–3% of model params, and asynchronous ANCE hard‑

Problem Statement

Enterprise search has two bottlenecks: 1) tenants have lots of unlabeled proprietary 'dark data' so supervised adaptation is hard; 2) jointly fine-tuning query+document encoders forces re-embedding and re-indexing the whole corpus, which is computationally and operationally prohibitive for thousands of tenants.

Main Contribution

DevRev-Search: a public, automatically labeled enterprise passage retrieval benchmark built from production support queries and segmented docs.

A fully automated pipeline: union of seven diverse retrievers to maximize candidate coverage, followed by an LLM-as-a-judge filter to improve precision.

Index-preserving Query-Only Adaptation: freeze document encoder and index, fine-tune only the query encoder; coupled with PEFT (LoRA, linear/FFN heads, partial unfreezing) to enable cheap, per-tenant adaptation.

Key Findings

Automated benchmark with dense relevance per query

Numbers291 train / 92 test queries; mean 13.61 golden chunks/query (σ=21.41)

Query-only fine-tuning matches joint (query+doc) tuning in many cases

Numbersqwen3-4b on SciFact Recall@10: Q=0.953 vs QD=0.949

LoRA and small heads match or exceed full query-only finetuning

Numbersarctic-l-v2 LoRA (r=32) Recall@10 0.309 vs Full 0.296; trainable params ≈ 2.5%

Multi-retriever fusion increases candidate coverage substantially

NumbersBest single retriever Recall@420: 82.48; leave-one-out union recall: 93.25–97.13

Representation collapse without dynamic negatives is a training risk

NumbersAsync ANCE updates every 200 steps improved stability and metrics (reported in Table 6 / Fig 2)

Results

recall@10 (DevRev-Search)

ValueBase 0.256 → Q 0.296 → QD 0.314

BaselineBase

recall@10 (SciFact)

ValueQD 0.949 → Q 0.953 (qwen3-4b)

BaselineQD

recall@10 (DevRev-Search) — PEFT vs Full

ValueLoRA (r=32) 0.309 vs Full 0.296

BaselineFull query-only

Who Should Care

What To Try In 7 Days

Run union of 3–7 diverse retrievers on recent query logs, then filter top candidates with an LLM-as-judge prompt.

Fine-tune only the query encoder using LoRA (r=32) or a small FFN head; evaluate recall@10 and ndcg@10 on held-out queries.

Add asynchronous ANCE hard-negative updates (every ~200 steps) to stabilize InfoNCE training.

Agent Features

Tool Use

  • LLM-as-a-Judge
  • LoRA
  • ANCE
  • BM25
  • LangChain

Frameworks

  • InfoNCE
  • AdamW
  • HNSW

Optimization Features

Infra Optimization

  • reduced storage and update cost for per-tenant models

Model Optimization

  • LoRA
  • linear embedding projection
  • small FFN head on embeddings

System Optimization

  • freeze document encoder to avoid re-indexing
  • small per-tenant parameter checkpoints (0.4–3% of model)

Training Optimization

  • asynchronous ANCE hard-negative mining
  • InfoNCE with mined hard negatives

Inference Optimization

  • index-preserving query-only adaptation (no document re-embedding)

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • LLM-as-judge can introduce bias and errors despite 10% human validation.
  • DevRev-Search is small (291 train queries), which may limit generalization to other enterprise domains.
  • Relies on proprietary embedding/LLM models for candidate generation and filtering.
  • Paper evaluates passage retrieval only; full reranking or cross-encoder steps are not covered.

When Not To Use

  • When you must update document encoder representations (you will need to re-index).
  • When you have abundant, high-quality human labels and can afford joint encoder updates.
  • For tasks that require cross-encoder interaction or fine-grained passage reranking beyond bi-encoder retrieval.

Failure Modes

  • Representation collapse if hard negatives are not refreshed; ANCE was required to stabilize training.
  • Overfitting at very high LoRA ranks for some models (peaked profiles reported).
  • LLM judge may accept superficially similar but non-answering chunks or reject partial-but-useful chunks.

Core Entities

Models

  • gemini-embedding-001
  • text-embedding-3-large
  • embed-english-v3
  • Qwen3-Embedding-8B
  • GTE-Qwen2-7B-Instruct
  • SFR-Embedding-Mistral
  • BM25
  • arctic-l-v2
  • snowflake-arctic-embed-l-v2.0
  • qwen3-4b
  • LoRA

Metrics

  • recall@10
  • ndcg@10
  • recall@420

Datasets

  • DevRev-Search
  • SciFact
  • FiQA-2018
  • MS-MARCO

Benchmarks

  • DevRev-Search