Dense Retrieval Papers — Parsed & Scored for Practitioners

Open-source toolkit, benchmark, and a retrieval-augmented LLM that proves Lean theorems on one GPU-week

0.60

0.65

0.70

38

LeanDojo lowers the entry cost for ML research on formal proofs: open data and code let teams reproduce and iterate on provers with a single GPU-week instead of thousands of GPU-days.

Key finding

Retrieval improves end-to-end proving rates.

Numbers: ReProver Pass@1 51.2% vs non-retrieval baseline 47.6% (random split)

Practical survey of retrieval-augmented generation (RAG): how retrievers, fusion methods, training and benchmarks fit together

0.80

0.25

0.65

13

RAG lets you add up-to-date, domain-specific facts without costly model retraining and reduces hallucinations by grounding outputs in external knowledge.

Key finding

Latent fusion (pretrained retrieval-enhanced models) can match much larger LLMs by using retrieval databases.

Numbers: RETRO: 2T-token DB, performance comparable to GPT-3 with ~25x fewer parameters

Survey: How to add, update, and use external knowledge with large language models

0.60

0.50

0.60

8

Keeping LLMs accurate saves user trust and legal risk: use prompt/input edits for cheap, fast fixes, model editing for durable updates, and retrieval for up-to-date answers when models show low confidence.

Key finding

Most knowledge-editing evaluations focus on triple-fact QA benchmarks like ZsRE and CounterFact.

Numbers: ZsRE: 182,282; CounterFact: 21,919

Domain-specific RAG cuts hallucinated citations in ophthalmology long-form answers

0.60

6

RAG with a focused domain corpus can substantially reduce fabricated citations, improving traceability for consumer health features, but it may require extra tuning to avoid small drops in perceived answer quality.

Key finding

RAG greatly increased the share of correct references in LLM outputs.

Numbers: Correct refs: 20.6% → 54.5% (252 vs 277 total refs)

Neural retrievers prefer LLM-generated text — datasets, causes, and a plug-in fix

0.60

0.50

5

If search or recommendation systems prefer LLM-generated content, human creators may lose visibility and ranking can be manipulated; businesses must audit source bias to protect content quality and trust.

Key finding

Neural retrievers prefer LLM-generated documents over semantically equivalent human text.

Numbers: ANCE Relative Δ NDCG@1 = -47.0% (SciFact+AIGC), Contriever Relative Δ NDCG@1 = -25.5%

Bailicai: a medical RAG system that gates retrieval, decomposes tasks with DAGs, and fine-tunes on curated medical data

0.70

5

Bailicai shows you can run an 8B open model locally with curated medical fine-tuning and selective retrieval to match or exceed ChatGPT-3.5 on medical QA, reducing API costs and privacy risk.

Key finding

Bailicai (8B) obtains a 71.82% average accuracy across five medical QA benchmarks.

Numbers: Average = 71.82% (Table V)

DomainRAG: a Chinese benchmark testing how RAG helps LLMs solve college-enrollment questions

0.60

0.45

0.60

4

For domain or expert apps, adding domain documents to an LLM pipeline is essential: verified references can move a model from near-random to high accuracy on factual Q&A.

Key finding

Providing golden domain references massively improves exact-match (EM) accuracy on extractive questions.

Numbers: GPT-3.5 EM: closed-book 0.1929 → golden 0.9233 (extractive, Table 2)

PNCExtract: a full-paper benchmark and LLM prompts to pull polymer nanocomposite samples

0.30

0.50

0.45

3

Automating extraction of polymer nanocomposite compositions speeds dataset creation for materials discovery but zero-shot LLMs still miss many entries, so expect a hybrid workflow with LLM-assisted triage plus human validation.

Key finding

Dataset size and scope

Numbers: 193 papers; 1,052 ground-truth samples

Cross-encoder re-ranking boosts faithfulness of RAG for CDC policy Q&A

0.60

0.40

0.60

2

Two-stage retrieval reduces factually incorrect outputs in policy applications, increasing trustworthiness at the cost of extra compute and complexity.

Key finding

Advanced RAG produced the highest grounding quality.

Numbers: Avg faithfulness: Vanilla 0.35 → Basic 0.62 → Advanced 0.80 (Table I)

Dual-source iterative retrieval (EHR + corpus) that pulls factual facts into RAG to boost medical QA on long clinical notes

0.60

2

RGAR boosts answer accuracy on tasks involving long clinical notes while keeping inference cost lower than heavier iterative RAGs, so teams can get clinically stronger retrieval without scaling model size.

Key finding

RGAR improves average accuracy across three factual-aware medical QA benchmarks compared with the non-retrieval baseline.

Numbers: Avg accuracy +11.91% over Custom baseline

Practical recipe and baseline for multilingual RAG across 13 languages

0.70

0.40

0.60

2

Multilingual RAG lets products answer factual questions in many languages by combining strong multilingual retrieval and tuned prompts, expanding reach and reducing wrong answers for non-English users.

Key finding

RAG substantially increases answer recall vs no retrieval on evaluated QA sets.

Numbers: MKQA English recall 58.4 -> 70.2; Arabic 26.4 -> 45.9 (Table 1).

RAG helps up to ~10–15 context snippets; model and retriever choice strongly shape results

0.60

0.40

0.45

1

When building a RAG product, supplying ~10–15 curated snippets gives the best return: more context adds cost and can add noise. Retrieval quality and reader model must be tuned to your domain.

Key finding

Adding more context boosts QA performance until about 10–15 snippets, then gains stop or reverse.

Numbers: Mixtral BioASQ entailment: 0→10 snippets 29.4%→50.7% (+21.3pp); open-retrieval stalls after 15–20 snippets.

Clinical-note Q&A by RAG: Wizard Vicuna gives high accuracy; quantization cuts latency ~48x

0.45

0.40

0.65

1

RAG lets teams extract factual details from clinical notes without costly model re-training; quantization makes high-capacity models usable in production by cutting latency and GPU cost.

Key finding

Wizard Vicuna (13B) + SentenceTransformers reached top single-document accuracy

Numbers: 80% accuracy (single-doc eval, 5 QA pairs)

FeB4RAG — a federated-search dataset built for modern RAG pipelines

0.60

0.50

1

When you feed retrieval results to an LLM, which resources you query and how you merge results materially affects answer quality, cost and latency.

Key finding

FeB4RAG contains 790 user requests across 16 simulated search engines derived from BEIR.

Numbers: 790 requests; 16 datasets; collection size 36.9M docs

Customize RAG for EDA docs: domain-tuned retriever, reranker, generator + ORD-QA benchmark

0.70

0.60

1

Specialized RAG reduces wrong answers on complex EDA docs, improving self-serve support and lowering costly human support for tooling documentation.

Key finding

Domain-finetuned embedding improves dense retrieval recall.

Numbers: recall@20: 0.733 (ours) vs 0.66 (bge-large) and 0.634 (text-embedding-ada-002)

MultiFuzz: dense-retrieval + multi-agent LLMs to push RTSP fuzzing deeper

0.50

0.60

0.50

0

MultiFuzz finds modest but consistent extra code paths and protocol states in stateful services by using indexed protocol docs and cooperating LLM agents, which can reveal hard-to-reach bugs in production network stacks.

Key finding

MultiFuzz reached average branch coverage of 2940 branches on Live555 RTSP.

Numbers: avg branches=2940 (Table I)

IndicRAGSuite: a 13-language retrieval benchmark plus ~14M synthetic QA triplets for Indian-language RAG

0.60

0.50

0.60

0

If you build search, QA, or assistant features for Indian users, IndicRAGSuite provides both a standard test (IndicMSMARCO) and large training data (~14M triplets) to reduce development time and improve retrieval in many Indian languages.

Key finding

IndicMSMARCO provides a high-quality multilingual benchmark of real queries.

Numbers: 1000 queries; 13 languages

BRIEFME: a SCOTUS-briefs benchmark testing summarization, completion, and case retrieval

0.60

0.40

0.60

0

Automating headings and guided completion can speed legal drafting and document navigation; however, retrieval and placement are not reliable enough to omit expert review.

Key finding

Large LLMs already produce high-quality brief headings for summarization and guided completion

Numbers: GPT-4o judge rating 4.3/5 vs human headings ~3.4/5 (summarization)

Probabilistic federated RAG that routes across product domains to boost multi-product QA

0.60

0

If product support queries span multiple products, probabilistic federated retrieval increases correct-document retrieval and improves answer quality without per-product LLM finetuning.

Key finding

MKP-QA outperforms baselines on retrieval and response quality.

Turn images and product text into millions of well‑matched SEO landing pages using VLM + LLM + CLIP

0.90

0.60

0.70

0

Automate landing-page generation from content to expand topic coverage, improve collection relevance, and increase organic search indexing with less manual curation.

Key finding

Very high attribute retrieval on a public benchmark

Numbers: Recall@10 = 99.7% on Fashion200K (Table 2, Sec 4.2.1)

Train first-stage dense retrievers from LLM search traces so they find theorems and code by reasoning, not keyword overlap.

0.60

0.70

0.60

0

If your product relies on retrieving concept-level knowledge or supporting LLM reasoning, switching to a reasoning-trained first-stage retriever can raise answer quality and be much more data-efficient than collecting large labeled datasets.

Key finding

RaDeR achieves top average on BRIGHT (nDCG@10 25.5) and beats strong baselines by ≥2 points.

Numbers: BRIGHT avg nDCG@10 = 25.5; ≥2 points over baselines

Multi-layer trainable pooling + bidirectional attention helps similarity and retrieval; trade-offs exist for clustering/classification

0.60

0.45

0.30

0

Small architecture changes in pooling and attention shift embedding quality between search/STS and clustering/classification. Choose pooling+attention by task: multi-layer pooling + bidirectional for search, simpler EOS-last + causal for clustering or classification.

Key finding

Multi-Layers Trainable Pooling + bidirectional attention (Model 5) gives the best STS and retrieval scores on Mistral-7B.

Numbers: STS +0.0166; Retrieval +0.0226 vs EOS-last+causal (Table 4)

Zero-shot LLM policy replaces lengthy RL training for controllable dialog planning and improves success in simulation and a user study

0.70

0.55

0.65

0

You can get higher dialog success without RL training and adapt instantly when domain graphs change. This reduces time-to-market for new dialog domains, cuts training costs, and lets you run local, controllable agents to avoid hallucinations in sensitive domains.

Key finding

CTS-LLM (GPT-4o-mini) raises dialog success over RL across three domains in simulation.

Numbers: REIMBURSE: 84.20% vs 73.86; DIAGNOSE: 98.80% vs 76.31; ONBOARD: 96.00% vs 73.61 (500 sims each).

Retrieve similar in-project proofs at each step to boost automated Coq proof synthesis

0.60

0.70

0

Rango automates more Coq proofs than prior tools, lowering manual proof effort and increasing coverage for projects that use Coq, which can reduce verification costs and time-to-audit.

Key finding

Rango proves a larger share of benchmark theorems than prior tools.

Numbers: 3,325 / 10,396 = 32.0% theorems proven on CoqStoq benchmark