Retrieval Optimization Papers — Parsed & Scored for Practitioners

Practical survey of RAG: paradigms, core components, benchmarks, and engineering gaps

0.70

0.30

0.60

612

RAG lets you keep LLMs current and auditable by fetching external facts at inference time; this reduces hallucinations and speeds updates without retraining the base model.

Key finding

Surveyed RAG work covers a broad task and dataset space.

Numbers: 26 tasks; ~50 datasets

Use RAG + PCST to let LLMs 'chat' with very large textual graphs

0.60

0.70

22

If you need natural-language queries over large text-rich graphs, G-Retriever scales to huge graphs, speeds training and inference dramatically, and reduces wrong citations by returning the exact subgraph used to answer.

Key finding

G-Retriever lifts WebQSP Hit@1 from 57.05% (GraphToken) to 70.49% with frozen LLM prompt tuning and to 73.79% with LoRA tuning.

Numbers: WebQSP: GraphToken 57.05% → G-Retriever 70.49% → G-Retriever+LoRA 73.79%

A public dataset and baseline results showing RAG struggles on multi-hop questions that need evidence from multiple documents

0.40

0.60

0.30

13

Multi-hop questions (e.g., cross-document finance or product research) are common and current RAG systems often miss key evidence; improving retrieval and reranking yields bigger gains than swapping LLMs alone.

Key finding

Dataset size and mix: 2,556 multi-hop queries drawn from 609 news articles.

Numbers: 2,556 queries; 609 articles; avg 2,046 tokens/article

Practical survey of retrieval-augmented generation (RAG): how retrievers, fusion methods, training and benchmarks fit together

0.80

0.25

0.65

13

RAG lets you add up-to-date, domain-specific facts without costly model retraining and reduces hallucinations by grounding outputs in external knowledge.

Key finding

Latent fusion (pretrained retrieval-enhanced models) can match much larger LLMs by using retrieval databases.

Numbers: RETRO: 2T-token DB, performance comparable to GPT-3 with ~25x fewer parameters

Jointly train retriever and medical LLM to improve accuracy, reduce hallucinations, and cut training cost

0.60

0.80

11

Joint retriever+LLM fine-tuning yields better medical QA accuracy and explanations while cutting training compute by orders of magnitude versus large-domain pretraining, making domain-specialized models cheaper and faster to build.

Key finding

JMLR-13B achieves the highest reported average accuracy across evaluated medical QA sets.

Numbers: Avg accuracy 70.5% (JMLR-13B) vs 68.9% (Meditron-70B)

RAG-grounded LLMs improve agent reply suggestions vs BERT and expose a retrieval/latency trade-off

0.70

0.40

0.60

5

RAG-grounded LLMs give agents more accurate, relevant replies than a BERT pair-matching system, cutting agent search time and likely reducing handling time.

Key finding

RAG responses scored much higher on human-evaluated accuracy than BERT.

Numbers: Accuracy +45.69% (human eval, Table 4)

Bailicai: a medical RAG system that gates retrieval, decomposes tasks with DAGs, and fine-tunes on curated medical data

0.70

5

Bailicai shows you can run an 8B open model locally with curated medical fine-tuning and selective retrieval to match or exceed ChatGPT-3.5 on medical QA, reducing API costs and privacy risk.

Key finding

Bailicai (8B) obtains a 71.82% average accuracy across five medical QA benchmarks.

Numbers: Average = 71.82% (Table V)

Cross-encoder re-ranking boosts faithfulness of RAG for CDC policy Q&A

0.60

0.40

0.60

2

Two-stage retrieval reduces factually incorrect outputs in policy applications, increasing trustworthiness at the cost of extra compute and complexity.

Key finding

Advanced RAG produced the highest grounding quality.

Numbers: Avg faithfulness: Vanilla 0.35 → Basic 0.62 → Advanced 0.80 (Table I)

Dual-source iterative retrieval (EHR + corpus) that pulls factual facts into RAG to boost medical QA on long clinical notes

0.60

2

RGAR boosts answer accuracy on tasks involving long clinical notes while keeping inference cost lower than heavier iterative RAGs, so teams can get clinically stronger retrieval without scaling model size.

Key finding

RGAR improves average accuracy across three factual-aware medical QA benchmarks compared with the non-retrieval baseline.

Numbers: Avg accuracy +11.91% over Custom baseline

Not all retrieval noise is bad: some noises consistently help LLMs, others break them

0.60

0.70

0.50

2

Retrieval noise can both harm and help RAG systems: fixing only 'noise' blindly can lose opportunities. Quick checks for counterfactuals and prior-errors prevent big failures, while controlled use of some noisy signals can boost accuracy by several points.

Key finding

RAG noises fall into two practical groups: beneficial and harmful.

Practical survey: a five‑phase Query Optimization Lifecycle and taxonomy for LLM-based RAG systems

0.60

0.50

0.70

2

Better queries reduce hallucination and improve downstream answer quality; matching optimization to query types saves API cost and improves customer trust.

Key finding

Query optimization is essential: retrieval quality strongly determines final answer quality in RAG.

Marathon: a multiple-choice benchmark that stresses LLMs with very long documents (up to ~260K chars)

0.60

2

If you publish or productize long-document QA, use retrieval with document embeddings — it gives consistent accuracy gains over naively feeding very long text and helps make outputs traceable.

Key finding

Embedding-based RAG improves average accuracy versus vanilla baseline.

Numbers: Vanilla avg 41.76% → OpenAI RAG avg 50.46% → Jina RAG avg 53.10%

Reuse multimodal KV caches at any position to cut first-token latency and double serving throughput

0.60

0.70

1

For multimodal services that reuse the same images or files, MPIC can cut prefill latency roughly in half and double serving throughput, lowering per-request compute and improving capacity without changing model weights.

Key finding

MPIC-32 reduces Time-to-First-Token (TTFT) by up to 54.1% versus prefix caching.

Numbers: TTFT reduced up to 54.1% (Fig.9; §5.2)

HyCE: run validated HPC commands inside RAG so an LLM answers user-specific cluster questions

0.60

0.50

1

HyCE reduces user confusion and support load by letting an LLM provide live, user-specific cluster answers without expensive model fine-tuning.

Key finding

Adding HyCE to a baseline RAG raised the automatic evaluation score.

Numbers: 77.67% → 82.33% (Δ +4.66%)

Use multiple LLM agents to filter noisy retrieved documents and improve RAG accuracy without any training

0.60

0.65

0.55

1

MAIN-RAG adds a low-cost layer to existing RAG systems that reduces noisy context and often raises answer accuracy without model retraining, lowering compute waste and speeding deployment.

Key finding

MAIN-RAG improves QA accuracy over training-free baselines on evaluated datasets.

Numbers: 2–11% overall improvement; up to +6.1% (Mistral7B) and +12.0% (Llama3-8B) reported

FeB4RAG — a federated-search dataset built for modern RAG pipelines

0.60

0.50

1

When you feed retrieval results to an LLM, which resources you query and how you merge results materially affects answer quality, cost and latency.

Key finding

FeB4RAG contains 790 user requests across 16 simulated search engines derived from BEIR.

Numbers: 790 requests; 16 datasets; collection size 36.9M docs

Use mutual-information retrieval + entropy pruning to edit LLMs for multi-hop QA

0.70

0.60

1

RAE lets you update LLM answers on multi-step questions quickly and cheaply by changing context instead of model weights, reducing API cost and avoiding retraining.

Key finding

MI-based retrieval substantially raises multi-hop retrieval precision compared to embedding or probability baselines.

Numbers: P@1 up to 84.0% (RAE Llama2) vs 78.3% (SR Llama2) and 52.7% (embedding, 2-hop)

Break event extraction into detect+extract and add schema-aware retrieval to cut hallucination and raise F1

0.60

0.40

1

Decomposed, retrieval-enhanced prompting gives more accurate structured events without fine-tuning, reducing manual labeling and improving downstream dashboards and knowledge graphs in days rather than months.

Key finding

Retrieval-augmented examples (RAE) plus decomposition raises ACE05-EN F1 for GPT-4.

Numbers: +5.18 Trig-C, +6.29 Arg-C (GPT-4, 5-shot → 5-shot+RAE on ACE05-EN)

Compress KV cache per layer with a pyramid-shaped budget to cut memory while keeping long‑context performance

0.70

0.65

0.70

1

PyramidKV reduces GPU memory for long-context inference by large factors while keeping retrieval and QA performance, enabling RAG and few‑shot workflows on cheaper hardware.

Key finding

PyramidKV can match full‑KV accuracy in needle-in‑a‑haystack retrieval with tiny caches.

Numbers: LLaMA-3-70B, 8k context, KV=128 → FullKV 100.0% vs PyramidKV 100.0%

UDA: a 2,965-document benchmark to stress-test RAG on messy PDFs, tables and numeric queries

0.65

0.55

0.45

1

If your product answers questions over real PDFs or financial reports, parsing and retrieval choices can change accuracy by tens of points; invest in indexing and retrieval before scaling model size.

Key finding

UDA contains 2,965 raw documents and 29,590 expert-annotated Q&A pairs.

Numbers: 2,965 documents; 29,590 Q&A (paper §3, Table 2)

Survey: how machine learning, LLMs, and agents are reshaping operating systems and the OS stack

0.45

0.55

0.60

1

AI techniques can reduce tail latency, improve throughput, lower storage errors, and cut datacenter costs, but require guardrails and staged deployment to avoid regressions and privacy risks.

Key finding

Lightweight ML in the kernel can sharply improve I/O predictability and throughput.

Numbers: LinnOS: up to 40% lower I/O latency; up to 3× throughput under contention

Store KV cache as compact PQ embeddings to fetch only relevant keys for long-context LLMs

0.70

0.60

0.70

0

PQCache reduces GPU memory needs for long-context LLMs while keeping or improving accuracy, lowering hardware cost and enabling longer-context features without costly GPU scaling.

Key finding

PQCache improves aggregate InfiniteBench scores versus prior selective-attention methods

Numbers: +4.60% avg score vs baselines on InfiniteBench

Reduce long-context LLM latency and keep accuracy past model input limits by reusing and re-positioning cached KV-contexts

0.60

0.70

0

CacheFocus lowers inference latency and keeps answer quality when LLMs must use many retrieved documents, saving compute cost on long-context production queries without retraining.

Key finding

CacheFocus cuts total 100-token generation time on long inputs compared to a naive baseline.

Numbers: 4K input (20 docs): total 4.611s -> 3.162s (−31.4%)

Use one-shot retrieval + light ML to run cheap, reliable UI tests at scale

0.80

0.50

0.80

0

You can automate most feature-level UI tests cheaply by combining a single retrieved example with ML-based element matching and calling LLMs only for ambiguous cases, cutting LLM spend while keeping test quality.

Key finding

CAT automates 90% of test tasks.

Numbers: 90% completion (Table 3, CAT)