Context Selection Papers — Parsed & Scored for Practitioners

Practical survey of RAG: paradigms, core components, benchmarks, and engineering gaps

0.70

0.30

0.60

612

RAG lets you keep LLMs current and auditable by fetching external facts at inference time; this reduces hallucinations and speeds updates without retraining the base model.

Key finding

Surveyed RAG work covers a broad task and dataset space.

Numbers: 26 tasks; ~50 datasets

RGB: a bilingual benchmark diagnosing how LLMs fail when using retrieved evidence

0.40

0.30

0.25

52

RAG can improve factuality, but retrieved noise and false facts cause wrong outputs and missed refusals, risking user trust and legal/brand exposure in production.

Key finding

Adding noisy retrieved documents lowers answer accuracy for all tested LLMs.

Numbers: ChatGPT accuracy 96.33% → 76.00% (noise ratio 0→0.8)

Have models write short reading notes per retrieved doc to ignore noise and say “unknown” when needed.

0.60

0.50

9

CON reduces incorrect answers caused by irrelevant retrieval and helps systems safely abstain on out-of-date or unknown queries, improving reliability in search and customer-facing QA products.

Key finding

CON improves average Exact Match over standard retrieve-then-read models when fine-tuning LLaMa-2 7B.

Numbers: EM +1.97 avg across NQ/TriviaQA/WebQ (Table 2)

Survey: How to add, update, and use external knowledge with large language models

0.60

0.50

0.60

8

Keeping LLMs accurate saves user trust and legal risk: use prompt/input edits for cheap, fast fixes, model editing for durable updates, and retrieval for up-to-date answers when models show low confidence.

Key finding

Most knowledge-editing evaluations focus on triple-fact QA benchmarks like ZsRE and CounterFact.

Numbers: ZsRE: 182,282; CounterFact: 21,919

RAG-grounded LLMs improve agent reply suggestions vs BERT and expose a retrieval/latency trade-off

0.70

0.40

0.60

5

RAG-grounded LLMs give agents more accurate, relevant replies than a BERT pair-matching system, cutting agent search time and likely reducing handling time.

Key finding

RAG responses scored much higher on human-evaluated accuracy than BERT.

Numbers: Accuracy +45.69% (human eval, Table 4)

Use past successful episodes as memory to boost LLM agent planning in text and vision tasks

0.60

0.70

0.50

5

RAP turns past successful runs into reusable context that raises accuracy for multi-step text and embodied agents, reducing trial-and-error and speeding up deployment in web automation and robotic workflows.

Key finding

RAP raises ALFWorld success from 52.2% (ReAct, GPT-3.5) to 85.8% with GPT-3.5.

Numbers: 52.2% → 85.8% (ALFWorld, Table 1)

Three practical tools for making LLMs more factual in finance: a benchmark, an injection framework, and a retrieval QA system

0.60

5

You can improve finance-specific LLM outputs quickly and cheaply by combining retrieval-based context with compact instruction fine-tuning, giving better factual answers and sourceable outputs without full model re-pretraining.

Key finding

GPT-4 leads on IDEA-FinBench across subjects.

Numbers: CFA-L1 accuracy 84.26%; CPA-SA 62.38%

Bailicai: a medical RAG system that gates retrieval, decomposes tasks with DAGs, and fine-tunes on curated medical data

0.70

5

Bailicai shows you can run an 8B open model locally with curated medical fine-tuning and selective retrieval to match or exceed ChatGPT-3.5 on medical QA, reducing API costs and privacy risk.

Key finding

Bailicai (8B) obtains a 71.82% average accuracy across five medical QA benchmarks.

Numbers: Average = 71.82% (Table V)

Hierarchical Agentic RAG: small LMs + prompt pools to boost forecasting, anomaly detection, and imputation

0.50

0.60

0.50

4

A modular Agentic-RAG can reduce forecasting errors and improve anomaly detection on operational time-series (traffic, industrial telemetry), enabling better planning and faster incident detection while allowing independent updates to sub-modules.

Key finding

Agentic-RAG reduces forecasting error on traffic benchmarks.

Numbers: PEMS-BAY Horizon@3 RMSE 1.62 vs DGCRN 2.69 (Table 4)

Survey of retrieval-augmented language models: architectures, retrievers, enhancements, and benchmarks

0.60

0.30

0.55

4

Retrieval augmentation makes LMs more factual and updatable by combining model memory with external, searchable knowledge, improving performance on knowledge-heavy tasks while enabling incremental updates without full model retraining.

Key finding

There are three high-level ways a retriever and LM interact: sequential single, sequential multiple (iterative), and parallel.

Numbers: 3 interaction modes (Section 2)

Use LLMs to generate extra context that improves specialized entity linking models

0.75

0.60

0.65

4

LLMaEL boosts entity-linking accuracy—especially for rare entities—without costly LLM fine-tuning, so teams can improve downstream QA, search, and recommendation systems with modest LLM usage.

Key finding

Lightweight fine-tuning with LLM-augmented data yields measurable gains over the original EL model.

Numbers: ReFinED avg acc 85.46% → LLMaEL × ReFinEDFT 86.67% (+1.21%)

Hybrid RAG that cuts hallucinations and improves multi‑step reasoning via chunking, tables, tool math, and KG

0.60

0.40

2

A practical pipeline that trades some answer coverage for far fewer incorrect claims, which is often preferable in QA products where wrong answers are costly.

Key finding

Full system flipped the Task 1 public score from strongly negative to positive.

Numbers: Score: Our 15.8% vs Official RAG -46.6% (Table 1)

Teach retrieval-augmented LMs to read and weigh sources by credibility so outputs stay correct under noisy or outdated retrievals

0.60

0.62

0.50

2

RAG-powered apps break when retrievers return noisy, outdated, or fake content; training models to use simple credibility labels raises accuracy and resilience without discarding documents.

Key finding

CAG-7B raises exact-match (EM) on HotpotQA compared to a LLaMA-2-7B retrieval baseline.

Numbers: HotpotQA EM: LLaMA-2-7B 0.280 -> CAG-7B 0.509 (+0.229)

Practical recipe and baseline for multilingual RAG across 13 languages

0.70

0.40

0.60

2

Multilingual RAG lets products answer factual questions in many languages by combining strong multilingual retrieval and tuned prompts, expanding reach and reducing wrong answers for non-English users.

Key finding

RAG substantially increases answer recall vs no retrieval on evaluated QA sets.

Numbers: MKQA English recall 58.4 -> 70.2; Arabic 26.4 -> 45.9 (Table 1).

Cut prompt cost by up to ~68% by keeping only query-relevant sentences and lightly compressing the rest

0.60

0.80

2

LeanContext lowers pay-per-use LLM input tokens so small teams can run domain QA faster and cheaper while keeping similar answer quality.

Key finding

Adaptive LeanContext reduces prompt tokens and saves cost with little accuracy loss

Numbers: ArXiv N=4: prompt tokens 321->521, cost savings 37.29%, ROUGE-1 drop 0.3985->0.3844 (-0.0141)

Picking the right paper sections (not the whole paper) improves LLM leaderboard extraction and cuts hallucinations

0.50

0.60

0.40

1

Feeding models only the right paper sections speeds up extraction, reduces hallucinations, and improves accuracy for leaderboard curation—lowering manual review and infrastructure cost.

Key finding

Targeted short context (DocTAET) yields the best paper-level leaderboard detection and structured-summary scores.

Numbers: Mistral-7B DocTAET: General Accuracy ≈ 89% (few-shot), 95% (zero-shot); ROUGE-1 ≈ 57.2 (few-shot).

ASTRID: three automated, scalable metrics (CF, RA, CR) to evaluate RAG clinical QA

0.70

0.40

0.60

1

ASTRID gives an automated, clinically validated way to detect ungrounded, out-of-scope, or irrelevant answers; this reduces expensive clinician review and speeds safe iterative development of RAG-based clinical agents.

Key finding

Conversational Faithfulness (CF) matches human perceived faithfulness much better than statement-level faithfulness (RF).

Numbers: AUC CF=0.98 vs RF=0.83; Pearson CF vs PF=0.90, RF vs PF=0.57

RAG helps up to ~10–15 context snippets; model and retriever choice strongly shape results

0.60

0.40

0.45

1

When building a RAG product, supplying ~10–15 curated snippets gives the best return: more context adds cost and can add noise. Retrieval quality and reader model must be tuned to your domain.

Key finding

Adding more context boosts QA performance until about 10–15 snippets, then gains stop or reverse.

Numbers: Mixtral BioASQ entailment: 0→10 snippets 29.4%→50.7% (+21.3pp); open-retrieval stalls after 15–20 snippets.

ARC-JSD: a fast, training-free JSD method to find which retrieved sentences make a RAG answer

0.70

0.60

0.70

1

ARC-JSD gives a cheap, plug-in way to show which retrieved sentences actually caused an LLM answer, cutting compute costs and reducing hallucinations—useful for product trust, compliance, and debugging.

Key finding

ARC-JSD improves top-1 sentence attribution accuracy versus prior training-free baselines.

Numbers: ≈10.7% average accuracy gain (MuSiQue summary; §4.2, Fig.2)

HyCE: run validated HPC commands inside RAG so an LLM answers user-specific cluster questions

0.60

0.50

1

HyCE reduces user confusion and support load by letting an LLM provide live, user-specific cluster answers without expensive model fine-tuning.

Key finding

Adding HyCE to a baseline RAG raised the automatic evaluation score.

Numbers: 77.67% → 82.33% (Δ +4.66%)

CorpusLM: unify generative retrieval and continuous RAG into one model

0.70

0.65

0.70

1

CorpusLM can replace a heavy index+reader stack with a single model that reduces storage and latency while improving factual retrieval and downstream answers on wiki-like corpora, lowering hosting and inference costs for knowledge-driven products.

Key finding

CorpusLM improves passage retrieval R-Precision on KILT FEVER over a strong dense baseline (MT-DPR).

Numbers: FEVER R-Precision: CorpusLM (T5) 75.64 vs MT-DPR 64.05 (Δ+11.59 pp)

Learn offline 'cheat-sheets' so a 4k LLaMA2 handles 128k tokens, cutting tokens and latency

0.75

0.60

0.80

1

LLoCO cuts token processing and GPU costs for long-document QA while improving accuracy and latency, letting teams serve very long documents without buying larger models or more GPUs.

Key finding

LLoCO raises average QA performance vs base LLaMA2-7B on evaluated long-doc tasks.

Numbers: Avg score 23.44 -> 30.67 (Table 1; +7.23 pts)

Compress long videos into timestamped event graphs so LLMs can answer long-horizon questions cheaply

0.45

0.70

0.72

0

SEG cuts LLM token costs by ~10× for long-video QA while keeping accuracy, letting companies add long-horizon video reasoning without expensive model or GPU scaling.

Key finding

SEG cuts token input by 91.4% on average.

Numbers: Tokens: Full Log 40.39k → TSG 3.47k (91.4% reduction)

ProMem: iterative self-questioning to recover missing facts and cut downstream errors

0.60

0.50

0

Improving what an agent saves (more complete, grounded memories) raises answer quality and reduces long-term error costs; pay once for extraction, benefit many reads.

Key finding

ProMem raises memory integrity on HaluMem to 73.80%, outperforming common summary baselines.

Numbers: Memory Integrity: ProMem 73.80% vs Mem0/Supermemory ~42%