Automated candidate labeling plus query-only PEFT lets you adapt search per tenant without re-indexing

Overview

Decision SnapshotReady For Pilot

The methods are practical and validated on three datasets; main limitations are dataset size and reliance on LLM judgments, but results show clear cost-quality tradeoffs for multi-tenant deployment.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Prateek Jain, Shabari S Nair, Ritesh Goru, Prakhar Agarwal, Ajay Yadav, Yoga Sri Varshan Varadharajan, Constantine Caramanis

Links

Abstract / PDF / Data

Why It Matters For Business

You can adapt search models per tenant without re-indexing documents, cutting compute, operational risk, and per-tenant storage by orders of magnitude while keeping retrieval quality close to joint fine-tuning.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

This paper builds DevRev-Search, an automatically labeled enterprise retrieval benchmark, and shows you can adapt search per-tenant by fine-tuning only the query encoder. Data is produced by unioning seven retrievers and filtering candidates with an LLM judge. Freezing the document index (no re-indexing) and using parameter-efficient tuning (LoRA, small FFNs, or few unfrozen layers) matches or closely approaches full bi-encoder tuning on DevRev-Search, SciFact, and FiQA-2018. Key concrete wins: a high-coverage candidate pool (union recall per query up to 420 candidates), LoRA at practical ranks (32–64) gives near-full performance while using ~1–3% of model params, and asynchronous ANCE hard‑

Problem Statement

Enterprise search has two bottlenecks: 1) tenants have lots of unlabeled proprietary 'dark data' so supervised adaptation is hard; 2) jointly fine-tuning query+document encoders forces re-embedding and re-indexing the whole corpus, which is computationally and operationally prohibitive for thousands of tenants.

Main Contribution

DevRev-Search: a public, automatically labeled enterprise passage retrieval benchmark built from production support queries and segmented docs.

A fully automated pipeline: union of seven diverse retrievers to maximize candidate coverage, followed by an LLM-as-a-judge filter to improve precision.

Key Findings

Automated benchmark with dense relevance per query

Numbers291 train / 92 test queries; mean 13.61 golden chunks/query (σ=21.41)

Practical UseYou get a small, high-density training set for enterprise-style queries to tune retrievers without manual labeling.

Evidence RefAppendix A.1

Query-only fine-tuning matches joint (query+doc) tuning in many cases

Numbersqwen3-4b on SciFact Recall@10: Q=0.953 vs QD=0.949

Practical UseFreeze document embeddings to avoid re-indexing and still get nearly the same retrieval quality on evaluated datasets.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
recall@10 (DevRev-Search)	Base 0.256 → Q 0.296 → QD 0.314	Base	Q vs QD: -0.018	DevRev-Search	Query-only (Q) substantially improves vs base and is competitive with query+doc (QD)	Table 1
recall@10 (SciFact)	QD 0.949 → Q 0.953 (qwen3-4b)	QD	Q vs QD: +0.004	SciFact	Query-only marginally outperforms joint tuning for qwen3-4b on SciFact	Table 1

What To Try In 7 Days

Run union of 3–7 diverse retrievers on recent query logs, then filter top candidates with an LLM-as-judge prompt.

Fine-tune only the query encoder using LoRA (r=32) or a small FFN head; evaluate recall@10 and ndcg@10 on held-out queries.

Add asynchronous ANCE hard-negative updates (every ~200 steps) to stabilize InfoNCE training.

Agent Features

Tool Use

LLM-as-a-JudgeLoRAANCEBM25LangChain

Frameworks

InfoNCEAdamWHNSW

Optimization Features

Infra Optimization

reduced storage and update cost for per-tenant models

Model Optimization

LoRAlinear embedding projectionsmall FFN head on embeddings

System Optimization

freeze document encoder to avoid re-indexingsmall per-tenant parameter checkpoints (0.4–3% of model)

Training Optimization

asynchronous ANCE hard-negative miningInfoNCE with mined hard negatives

Inference Optimization

index-preserving query-only adaptation (no document re-embedding)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://huggingface.co/datasets/devrev/search

Risks & Boundaries

Limitations

LLM-as-judge can introduce bias and errors despite 10% human validation.

DevRev-Search is small (291 train queries), which may limit generalization to other enterprise domains.

When Not To Use

When you must update document encoder representations (you will need to re-index).

When you have abundant, high-quality human labels and can afford joint encoder updates.

Failure Modes

Representation collapse if hard negatives are not refreshed; ANCE was required to stabilize training.

Overfitting at very high LoRA ranks for some models (peaked profiles reported).

Core Entities

Models

gemini-embedding-001text-embedding-3-largeembed-english-v3Qwen3-Embedding-8BGTE-Qwen2-7B-InstructSFR-Embedding-MistralBM25arctic-l-v2snowflake-arctic-embed-l-v2.0qwen3-4bLoRA

Metrics

recall@10ndcg@10recall@420

Datasets

DevRev-SearchSciFactFiQA-2018MS-MARCO

Benchmarks

DevRev-Search

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Automated benchmark with dense relevance per query

Query-only fine-tuning matches joint (query+doc) tuning in many cases

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

Key finding

FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

Key finding

Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

Key finding