Automated candidate labeling plus query-only PEFT lets you adapt search per tenant without re-indexing

January 8, 20267 min

Overview

Decision SnapshotReady For Pilot

The methods are practical and validated on three datasets; main limitations are dataset size and reliance on LLM judgments, but results show clear cost-quality tradeoffs for multi-tenant deployment.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Prateek Jain, Shabari S Nair, Ritesh Goru, Prakhar Agarwal, Ajay Yadav, Yoga Sri Varshan Varadharajan, Constantine Caramanis

Links

Abstract / PDF / Data

Why It Matters For Business

You can adapt search models per tenant without re-indexing documents, cutting compute, operational risk, and per-tenant storage by orders of magnitude while keeping retrieval quality close to joint fine-tuning.

Who Should Care

Summary TLDR

This paper builds DevRev-Search, an automatically labeled enterprise retrieval benchmark, and shows you can adapt search per-tenant by fine-tuning only the query encoder. Data is produced by unioning seven retrievers and filtering candidates with an LLM judge. Freezing the document index (no re-indexing) and using parameter-efficient tuning (LoRA, small FFNs, or few unfrozen layers) matches or closely approaches full bi-encoder tuning on DevRev-Search, SciFact, and FiQA-2018. Key concrete wins: a high-coverage candidate pool (union recall per query up to 420 candidates), LoRA at practical ranks (32–64) gives near-full performance while using ~1–3% of model params, and asynchronous ANCE hard‑

Problem Statement

Enterprise search has two bottlenecks: 1) tenants have lots of unlabeled proprietary 'dark data' so supervised adaptation is hard; 2) jointly fine-tuning query+document encoders forces re-embedding and re-indexing the whole corpus, which is computationally and operationally prohibitive for thousands of tenants.

Main Contribution

DevRev-Search: a public, automatically labeled enterprise passage retrieval benchmark built from production support queries and segmented docs.

A fully automated pipeline: union of seven diverse retrievers to maximize candidate coverage, followed by an LLM-as-a-judge filter to improve precision.

Key Findings

Automated benchmark with dense relevance per query

Numbers291 train / 92 test queries; mean 13.61 golden chunks/query (σ=21.41)

Practical UseYou get a small, high-density training set for enterprise-style queries to tune retrievers without manual labeling.

Evidence RefAppendix A.1

Query-only fine-tuning matches joint (query+doc) tuning in many cases

Numbersqwen3-4b on SciFact Recall@10: Q=0.953 vs QD=0.949

Practical UseFreeze document embeddings to avoid re-indexing and still get nearly the same retrieval quality on evaluated datasets.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
recall@10 (DevRev-Search)Base 0.256 → Q 0.296 → QD 0.314BaseQ vs QD: -0.018DevRev-SearchQuery-only (Q) substantially improves vs base and is competitive with query+doc (QD)Table 1
recall@10 (SciFact)QD 0.949 → Q 0.953 (qwen3-4b)QDQ vs QD: +0.004SciFactQuery-only marginally outperforms joint tuning for qwen3-4b on SciFactTable 1

What To Try In 7 Days

Run union of 3–7 diverse retrievers on recent query logs, then filter top candidates with an LLM-as-judge prompt.

Fine-tune only the query encoder using LoRA (r=32) or a small FFN head; evaluate recall@10 and ndcg@10 on held-out queries.

Add asynchronous ANCE hard-negative updates (every ~200 steps) to stabilize InfoNCE training.

Agent Features

Tool Use
LLM-as-a-JudgeLoRAANCEBM25LangChain
Frameworks
InfoNCEAdamWHNSW

Optimization Features

Infra Optimization
reduced storage and update cost for per-tenant models
Model Optimization
LoRAlinear embedding projectionsmall FFN head on embeddings
System Optimization
freeze document encoder to avoid re-indexingsmall per-tenant parameter checkpoints (0.4–3% of model)
Training Optimization
asynchronous ANCE hard-negative miningInfoNCE with mined hard negatives
Inference Optimization
index-preserving query-only adaptation (no document re-embedding)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

LLM-as-judge can introduce bias and errors despite 10% human validation.

DevRev-Search is small (291 train queries), which may limit generalization to other enterprise domains.

When Not To Use

When you must update document encoder representations (you will need to re-index).

When you have abundant, high-quality human labels and can afford joint encoder updates.

Failure Modes

Representation collapse if hard negatives are not refreshed; ANCE was required to stabilize training.

Overfitting at very high LoRA ranks for some models (peaked profiles reported).

Core Entities

Models

gemini-embedding-001text-embedding-3-largeembed-english-v3Qwen3-Embedding-8BGTE-Qwen2-7B-InstructSFR-Embedding-MistralBM25arctic-l-v2snowflake-arctic-embed-l-v2.0qwen3-4bLoRA

Metrics

recall@10ndcg@10recall@420

Datasets

DevRev-SearchSciFactFiQA-2018MS-MARCO

Benchmarks

DevRev-Search