Overview
The methods are practical and validated on three datasets; main limitations are dataset size and reliance on LLM judgments, but results show clear cost-quality tradeoffs for multi-tenant deployment.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
You can adapt search models per tenant without re-indexing documents, cutting compute, operational risk, and per-tenant storage by orders of magnitude while keeping retrieval quality close to joint fine-tuning.
Who Should Care
Summary TLDR
This paper builds DevRev-Search, an automatically labeled enterprise retrieval benchmark, and shows you can adapt search per-tenant by fine-tuning only the query encoder. Data is produced by unioning seven retrievers and filtering candidates with an LLM judge. Freezing the document index (no re-indexing) and using parameter-efficient tuning (LoRA, small FFNs, or few unfrozen layers) matches or closely approaches full bi-encoder tuning on DevRev-Search, SciFact, and FiQA-2018. Key concrete wins: a high-coverage candidate pool (union recall per query up to 420 candidates), LoRA at practical ranks (32–64) gives near-full performance while using ~1–3% of model params, and asynchronous ANCE hard‑
Problem Statement
Enterprise search has two bottlenecks: 1) tenants have lots of unlabeled proprietary 'dark data' so supervised adaptation is hard; 2) jointly fine-tuning query+document encoders forces re-embedding and re-indexing the whole corpus, which is computationally and operationally prohibitive for thousands of tenants.
Main Contribution
DevRev-Search: a public, automatically labeled enterprise passage retrieval benchmark built from production support queries and segmented docs.
A fully automated pipeline: union of seven diverse retrievers to maximize candidate coverage, followed by an LLM-as-a-judge filter to improve precision.
Key Findings
Automated benchmark with dense relevance per query
Query-only fine-tuning matches joint (query+doc) tuning in many cases
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| recall@10 (DevRev-Search) | Base 0.256 → Q 0.296 → QD 0.314 | Base | Q vs QD: -0.018 | DevRev-Search | Query-only (Q) substantially improves vs base and is competitive with query+doc (QD) | Table 1 |
| recall@10 (SciFact) | QD 0.949 → Q 0.953 (qwen3-4b) | QD | Q vs QD: +0.004 | SciFact | Query-only marginally outperforms joint tuning for qwen3-4b on SciFact | Table 1 |
What To Try In 7 Days
Run union of 3–7 diverse retrievers on recent query logs, then filter top candidates with an LLM-as-judge prompt.
Fine-tune only the query encoder using LoRA (r=32) or a small FFN head; evaluate recall@10 and ndcg@10 on held-out queries.
Add asynchronous ANCE hard-negative updates (every ~200 steps) to stabilize InfoNCE training.
Agent Features
Tool Use
Frameworks
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
LLM-as-judge can introduce bias and errors despite 10% human validation.
DevRev-Search is small (291 train queries), which may limit generalization to other enterprise domains.
When Not To Use
When you must update document encoder representations (you will need to re-index).
When you have abundant, high-quality human labels and can afford joint encoder updates.
Failure Modes
Representation collapse if hard negatives are not refreshed; ANCE was required to stabilize training.
Overfitting at very high LoRA ranks for some models (peaked profiles reported).

