Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
You can adapt search models per tenant without re-indexing documents, cutting compute, operational risk, and per-tenant storage by orders of magnitude while keeping retrieval quality close to joint fine-tuning.
Summary TLDR
This paper builds DevRev-Search, an automatically labeled enterprise retrieval benchmark, and shows you can adapt search per-tenant by fine-tuning only the query encoder. Data is produced by unioning seven retrievers and filtering candidates with an LLM judge. Freezing the document index (no re-indexing) and using parameter-efficient tuning (LoRA, small FFNs, or few unfrozen layers) matches or closely approaches full bi-encoder tuning on DevRev-Search, SciFact, and FiQA-2018. Key concrete wins: a high-coverage candidate pool (union recall per query up to 420 candidates), LoRA at practical ranks (32–64) gives near-full performance while using ~1–3% of model params, and asynchronous ANCE hard‑
Problem Statement
Enterprise search has two bottlenecks: 1) tenants have lots of unlabeled proprietary 'dark data' so supervised adaptation is hard; 2) jointly fine-tuning query+document encoders forces re-embedding and re-indexing the whole corpus, which is computationally and operationally prohibitive for thousands of tenants.
Main Contribution
DevRev-Search: a public, automatically labeled enterprise passage retrieval benchmark built from production support queries and segmented docs.
A fully automated pipeline: union of seven diverse retrievers to maximize candidate coverage, followed by an LLM-as-a-judge filter to improve precision.
Index-preserving Query-Only Adaptation: freeze document encoder and index, fine-tune only the query encoder; coupled with PEFT (LoRA, linear/FFN heads, partial unfreezing) to enable cheap, per-tenant adaptation.
Key Findings
Automated benchmark with dense relevance per query
Query-only fine-tuning matches joint (query+doc) tuning in many cases
LoRA and small heads match or exceed full query-only finetuning
Multi-retriever fusion increases candidate coverage substantially
Representation collapse without dynamic negatives is a training risk
Results
recall@10 (DevRev-Search)
recall@10 (SciFact)
recall@10 (DevRev-Search) — PEFT vs Full
Who Should Care
What To Try In 7 Days
Run union of 3–7 diverse retrievers on recent query logs, then filter top candidates with an LLM-as-judge prompt.
Fine-tune only the query encoder using LoRA (r=32) or a small FFN head; evaluate recall@10 and ndcg@10 on held-out queries.
Add asynchronous ANCE hard-negative updates (every ~200 steps) to stabilize InfoNCE training.
Agent Features
Tool Use
- LLM-as-a-Judge
- LoRA
- ANCE
- BM25
- LangChain
Frameworks
- InfoNCE
- AdamW
- HNSW
Optimization Features
Infra Optimization
- reduced storage and update cost for per-tenant models
Model Optimization
- LoRA
- linear embedding projection
- small FFN head on embeddings
System Optimization
- freeze document encoder to avoid re-indexing
- small per-tenant parameter checkpoints (0.4–3% of model)
Training Optimization
- asynchronous ANCE hard-negative mining
- InfoNCE with mined hard negatives
Inference Optimization
- index-preserving query-only adaptation (no document re-embedding)
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- LLM-as-judge can introduce bias and errors despite 10% human validation.
- DevRev-Search is small (291 train queries), which may limit generalization to other enterprise domains.
- Relies on proprietary embedding/LLM models for candidate generation and filtering.
- Paper evaluates passage retrieval only; full reranking or cross-encoder steps are not covered.
When Not To Use
- When you must update document encoder representations (you will need to re-index).
- When you have abundant, high-quality human labels and can afford joint encoder updates.
- For tasks that require cross-encoder interaction or fine-grained passage reranking beyond bi-encoder retrieval.
Failure Modes
- Representation collapse if hard negatives are not refreshed; ANCE was required to stabilize training.
- Overfitting at very high LoRA ranks for some models (peaked profiles reported).
- LLM judge may accept superficially similar but non-answering chunks or reject partial-but-useful chunks.
Core Entities
Models
- gemini-embedding-001
- text-embedding-3-large
- embed-english-v3
- Qwen3-Embedding-8B
- GTE-Qwen2-7B-Instruct
- SFR-Embedding-Mistral
- BM25
- arctic-l-v2
- snowflake-arctic-embed-l-v2.0
- qwen3-4b
- LoRA
Metrics
- recall@10
- ndcg@10
- recall@420
Datasets
- DevRev-Search
- SciFact
- FiQA-2018
- MS-MARCO
Benchmarks
- DevRev-Search

