Retrieval Augmented Generation (RAG) Papers — Parsed & Scored for Practitioners

Practical survey of RAG: paradigms, core components, benchmarks, and engineering gaps

0.70

0.30

0.60

612

RAG lets you keep LLMs current and auditable by fetching external facts at inference time; this reduces hallucinations and speeds updates without retraining the base model.

Key finding

Surveyed RAG work covers a broad task and dataset space.

Numbers: 26 tasks; ~50 datasets

Augment ChatGPT with retrieved evidence and automated feedback to cut hallucinations

0.60

0.55

0.45

144

You can keep using a black-box LLM while reducing harmful hallucinations by adding retrieval, evidence consolidation, and automated feedback—improving factuality with modest engineering instead of costly fine-tuning.

Key finding

Retrieving consolidated evidence raises knowledge grounding (KF1) by about +10 points on news dialog.

Numbers: KF1: 26.71 -> 36.41 (ChatGPT -> LLM-AUGMENTER, News Chat, Table 1)

RGB: a bilingual benchmark diagnosing how LLMs fail when using retrieved evidence

0.40

0.30

0.25

52

RAG can improve factuality, but retrieved noise and false facts cause wrong outputs and missed refusals, risking user trust and legal/brand exposure in production.

Key finding

Adding noisy retrieved documents lowers answer accuracy for all tested LLMs.

Numbers: ChatGPT accuracy 96.33% → 76.00% (noise ratio 0→0.8)

Practical survey of what makes LLMs factual, how we test it, and how to fix it

0.60

0.40

0.60

52

LLMs are useful but make verifiable mistakes; businesses must add retrieval, verification, or domain tuning before using LLM outputs in advice, legal, medical, or financial workflows.

Key finding

Off-the-shelf LLMs often have low factual precision on long-form biographical text.

Numbers: FActScore range 42%–71% for commercial LLMs on biographies

PaperQA: an agentic RAG that retrieves full-text papers, cites sources, and matches experts on a new LitQA benchmark

0.70

0.55

0.80

51

PaperQA shows agentic retrieval plus LLMs can deliver near-expert literature answers with reliable citations and low cost per query, making it practical for automated literature triage, fast reviews, and decision support.

Key finding

PaperQA achieves 69.5% accuracy on LitQA, slightly above human experts.

Numbers: PaperQA 69.5% vs Human 66.8% (LitQA, Table 2)

Find and fix contradictions in an LLM's own text without web lookups

0.70

0.60

46

Automate contradiction checks to catch internal hallucinations that retrieval misses, improving trust in long-form outputs and answers with modest extra cost.

Key finding

Self-contradictions are common in open-domain generations.

Numbers: 17.7% of sentences for ChatGPT (MainTestSet)

Open-source toolkit, benchmark, and a retrieval-augmented LLM that proves Lean theorems on one GPU-week

0.60

0.65

0.70

38

LeanDojo lowers the entry cost for ML research on formal proofs: open data and code let teams reproduce and iterate on provers with a single GPU-week instead of thousands of GPU-days.

Key finding

Retrieval improves end-to-end proving rates.

Numbers: ReProver Pass@1 51.2% vs non-retrieval baseline 47.6% (random split)

RoG: Ground LLM plans on knowledge‑graph relation paths for faithful, interpretable KGQA

0.60

0.50

38

RoG reduces hallucinations by grounding LLM reasoning in KG facts and provides traceable, human-readable paths—this improves accuracy and trust on KG-backed QA without retraining every LLM.

Key finding

RoG sets new best scores on standard KGQA benchmarks.

Numbers: WebQSP Hits@1 85.7; F1 70.8. CWQ Hits@1 62.6; F1 56.2.

Top legal AI tools still hallucinate: 17–33% of answers are false or misleading

0.45

0.25

0.30

32

Major legal AI products still produce false or misleading legal claims often enough that lawyers must verify outputs, which affects liability, trust, and the realized efficiency gains.

Key finding

Lexis+ AI provided accurate (correct + grounded) answers for 65% of queries.

Numbers: 65% accuracy (Figure 4; Section 6.1)

Survey: five practical ways LLMs are used to plan agent behavior

0.40

0.60

29

LLM-driven planning can automate complex multi-step tasks, but higher success usually requires more model calls and tokens, so balance accuracy needs with token cost and latency.

Key finding

Spending more tokens (more generated ‘thinking’) tends to raise success.

Numbers: ALFWorld SR: ReAct 0.57 -> Reflexion 0.71; EX($): 152.18 -> 220.17 (Table 2).

A practical map of how knowledge graphs and multimodal AI fit together today and where to push next

0.60

0.50

0.60

28

Adding structured knowledge to multimodal systems improves accuracy, interpretability, and long-tail reasoning. That helps applications like search, recommendation, product QA, and compliance where factual grounding and rare facts matter.

Key finding

The survey covers more than 300 related papers.

Numbers: ‘over 300 articles’ (abstract)

Fine-tune a Chinese 13B LLM with legal syllogism data plus retrieval to build a practical legal assistant and benchmark

0.50

24

Fine-tuning a mid-size Chinese LLM with focused legal instruction data and a small retrieval KB yields measurable gains in legal QA and advice; this reduces manual review and makes legal tools more practical.

Key finding

Large, law-specific SFT dataset built for training.

Numbers: DISC-Law-SFT total size 403K samples

Survey: using graph structure to make RAG more precise, concise, and context-aware

0.40

0.60

22

GraphRAG injects relational facts into LLM outputs, reducing hallucination and shortening input prompts; this improves accuracy for QA, search, and domain workflows while leveraging existing graph databases.

Key finding

GraphRAG workflow decomposes into three repeatable stages: Graph-Based Indexing, Graph-Guided Retrieval, and Graph-Enhanced Generation.

Numbers: 3 stages

Use RAG + PCST to let LLMs 'chat' with very large textual graphs

0.60

0.70

22

If you need natural-language queries over large text-rich graphs, G-Retriever scales to huge graphs, speeds training and inference dramatically, and reduces wrong citations by returning the exact subgraph used to answer.

Key finding

G-Retriever lifts WebQSP Hit@1 from 57.05% (GraphToken) to 70.49% with frozen LLM prompt tuning and to 73.79% with LoRA tuning.

Numbers: WebQSP: GraphToken 57.05% → G-Retriever 70.49% → G-Retriever+LoRA 73.79%

Hierarchical ReAct agents ground LLMs to Materials Project data and run language-driven simulations with near-zero hallucination

0.70

0.55

0.60

21

Grounding LLMs to authoritative databases and tools reduces dangerous hallucinations and lets teams automate reproducible workflows (data fetch → simulation → analysis) without model fine-tuning, cutting verification time and accelerating materials R&D.

Key finding

LLaMP reduces bulk-modulus prediction error compared to web-augmented GPT-4 and other baselines.

Numbers: Bulk modulus MAE = 14.57 GPa (LLaMP) vs ~41 GPa (GPT-4/GPT-4+Serp) on evaluated set

Using a targeted RAG pipeline and curated CMU dataset to reduce LLM hallucinations on domain queries

0.30

0.40

0.50

19

Connecting an LLM to a curated domain knowledge base (RAG) gives measurable factual gains and is a practical first step before costly generator finetuning.

Key finding

Adding RAG boosts retrieval and answer quality over the baseline LLM.

Numbers: Recall 0.361 -> 0.409; F1 0.186 -> 0.289

LLM4Vuln + UniVul: separate an LLM's reasoning from retrieved knowledge, context, and prompts to measure real vulnerability-detection skill

0.60

0.70

0.50

17

LLM4Vuln helps teams know whether an LLM truly reasons about vulnerabilities or just repeats retrieved knowledge; this prevents wasted engineering on useless retrievals and guides model+tool choices for auditing code and triage.

Key finding

Knowledge retrieval helps foundation models on logic-heavy Solidity but not uniformly elsewhere.

Numbers: F1 for traditional foundation models on Solidity nearly doubled on average with knowledge

Survey: Can knowledge graphs reduce hallucinations in large language models?

0.60

0.50

0.70

16

Adding knowledge graphs to LLMs can cut factual errors quickly, especially for small models and domain tasks, improving trustworthiness without full model retraining.

Key finding

KG-augmented retrieval can dramatically improve QA correctness for small models.

Numbers: reported >80% answer correctness gain on QA (Baek et al.; Sen et al.; Wu et al.)

Use retrieved similar programs and generated test cases in prompts to boost code-generation accuracy

0.60

0.50

16

AceCoder raises the chance that a single generated program is correct without fine-tuning, so teams can get better automated code suggestions cheaply if they can supply or index similar existing code.

Key finding

AceCoder markedly increases strict execution accuracy (Pass@1) over few-shot prompting on public benchmarks.

Numbers: Pass@1 +56.4% (MBPP); +70.7% (MBJP); +88.4% (MBJSP)

ReWOO separates planning from fetching evidence to cut repeating prompt tokens and run smaller models

0.70

0.60

0.80

15

ReWOO cuts API token usage and hosting cost by separating planning from tool calls, so multi-step tool-using pipelines can run cheaper and scale with smaller models.

Key finding

ReWOO reduces token use on HotpotQA by about 5× compared to an observation-dependent ALM (ReAct).

Numbers: ReAct 9795.1 tokens vs ReWOO 1986.2 tokens (HotpotQA)

A public dataset and baseline results showing RAG struggles on multi-hop questions that need evidence from multiple documents

0.40

0.60

0.30

13

Multi-hop questions (e.g., cross-document finance or product research) are common and current RAG systems often miss key evidence; improving retrieval and reranking yields bigger gains than swapping LLMs alone.

Key finding

Dataset size and mix: 2,556 multi-hop queries drawn from 609 news articles.

Numbers: 2,556 queries; 609 articles; avg 2,046 tokens/article

Practical survey of retrieval-augmented generation (RAG): how retrievers, fusion methods, training and benchmarks fit together

0.80

0.25

0.65

13

RAG lets you add up-to-date, domain-specific facts without costly model retraining and reduces hallucinations by grounding outputs in external knowledge.

Key finding

Latent fusion (pretrained retrieval-enhanced models) can match much larger LLMs by using retrieval databases.

Numbers: RETRO: 2T-token DB, performance comparable to GPT-3 with ~25x fewer parameters

KG-Agent: a tool-augmented autonomous 7B LLM that reasons step-by-step over knowledge graphs

0.60

0.65

0.70

12

You can get KG-backed, multi-hop reasoning without expensive closed LLM APIs by fine-tuning a 7B open model on ~10K program-like instructions, cutting cost and improving cross-domain use of external KGs.

Key finding

Instruction-tuned KG-Agent (LLaMA2-7B) improves KGQA F1 over prior baselines on in-domain tests.

Numbers: F1 gains: WebQSP +1.7%, CWQ +7.5%, GrailQA +2.7% (Sec 5.2, Table 2)

FINANCEBENCH: 10,231 open-book financial QA cases to stress-test LLMs

0.30

0.40

0.25

11

Out-of-the-box LLMs often fail on firm-specific financial questions. Firms must validate retrieval, prompt order, and verification steps before trusting outputs in decisions.

Key finding

FINANCEBENCH contains 10,231 curated QA-evidence triplets.

Numbers: 10,231 cases; 360 documents; 40 companies