Iterative RAG Papers — Parsed & Scored for Practitioners

Augment ChatGPT with retrieved evidence and automated feedback to cut hallucinations

0.60

0.55

0.45

144

You can keep using a black-box LLM while reducing harmful hallucinations by adding retrieval, evidence consolidation, and automated feedback—improving factuality with modest engineering instead of costly fine-tuning.

Key finding

Retrieving consolidated evidence raises knowledge grounding (KF1) by about +10 points on news dialog.

Numbers: KF1: 26.71 -> 36.41 (ChatGPT -> LLM-AUGMENTER, News Chat, Table 1)

REFEED: refine LLM outputs by retrieving documents about the model's own answers

0.60

0.50

0.75

10

You can improve factual accuracy of LLM outputs at inference time without costly fine-tuning by adding a retrieval-feedback loop that conditions retrieval on model answers.

Key finding

REFEED improves open-domain QA accuracy over retrieve-then-read baselines in zero-shot experiments.

Numbers: +~6% overall (reported) zero-shot improvement

Have models write short reading notes per retrieved doc to ignore noise and say “unknown” when needed.

0.60

0.50

9

CON reduces incorrect answers caused by irrelevant retrieval and helps systems safely abstain on out-of-date or unknown queries, improving reliability in search and customer-facing QA products.

Key finding

CON improves average Exact Match over standard retrieve-then-read models when fine-tuning LLaMa-2 7B.

Numbers: EM +1.97 avg across NQ/TriviaQA/WebQ (Table 2)

Survey: hybrid LLM architectures (RAG, agents, verifiers) for complex question answering

0.70

0.50

0.80

6

For real-world complex Q&A, LLMs must be combined with retrieval, tools, verifiers and human feedback to get accuracy, auditability and privacy—this reduces risk and improves trust but raises cost and latency.

Key finding

Best-practice stacks now couple agentic controllers, retrieval-grounding, and verifier/PRM loops to answer complex questions.

Survey of retrieval-augmented language models: architectures, retrievers, enhancements, and benchmarks

0.60

0.30

0.55

4

Retrieval augmentation makes LMs more factual and updatable by combining model memory with external, searchable knowledge, improving performance on knowledge-heavy tasks while enabling incremental updates without full model retraining.

Key finding

There are three high-level ways a retriever and LM interact: sequential single, sequential multiple (iterative), and parallel.

Numbers: 3 interaction modes (Section 2)

AutoSurvey: use retrieval and parallel LLMs to auto-write long, citation-backed surveys

0.70

0.60

0.80

4

AutoSurvey turns long, costly survey writing into fast, repeatable drafts that are almost human-quality for coverage and relevance, letting teams scan and document literature rapidly and cheaply.

Key finding

AutoSurvey is far faster than humans and naive RAG for long surveys.

Numbers: 64k-token speed: AutoSurvey 73.59 vs human 0.07 and naive RAG 12.56 (surveys/hour)

MTRAG: a human-made benchmark of multi-turn RAG conversations that stresses retrieval, unanswerables, and later-turn context.

0.60

0.50

0.40

2

Multi-turn customer or assistant flows need better retrievers and grounded generators; MTRAG shows current systems miss later-turn context and unanswerables, causing wrong or misleading answers that harm trust.

Key finding

Benchmark scale and structure

Numbers: 110 conversations; 842 tasks; avg 7.7 turns; 16.9 unique passages/conversation

Dual-source iterative retrieval (EHR + corpus) that pulls factual facts into RAG to boost medical QA on long clinical notes

0.60

2

RGAR boosts answer accuracy on tasks involving long clinical notes while keeping inference cost lower than heavier iterative RAGs, so teams can get clinically stronger retrieval without scaling model size.

Key finding

RGAR improves average accuracy across three factual-aware medical QA benchmarks compared with the non-retrieval baseline.

Numbers: Avg accuracy +11.91% over Custom baseline

Automating IEEE BioCompute Object creation from papers using RAG and LLMs

0.60

0.50

0.60

2

Automating BCO creation cuts manual work for documenting legacy bioinformatics workflows and speeds evaluation, handoff, and regulatory review when human verification is applied.

Key finding

RAG plus LLMs can produce domain-specific BCO text from papers and repos.

Practical survey: a five‑phase Query Optimization Lifecycle and taxonomy for LLM-based RAG systems

0.60

0.50

0.70

2

Better queries reduce hallucination and improve downstream answer quality; matching optimization to query types saves API cost and improves customer trust.

Key finding

Query optimization is essential: retrieval quality strongly determines final answer quality in RAG.

InsQABench: a Chinese insurance QA benchmark plus SQL-ReAct and RAG-ReAct methods

0.60

0.70

1

Insurance products are described across short FAQs, structured product databases, and long legal clauses. A unified benchmark plus task-specific pipelines (SQL-ReAct, RAG-ReAct) cut mistakes in automated answers and speed up building customer-facing QA tools.

Key finding

Supervised fine-tuning (LoRA) raises commonsense QA accuracy for GLM4-9B from 64.40 to 70.26.

Numbers: ACC +5.86 (64.40 → 70.26)

Tree-structured TCM knowledge + self-reflective retrieval boosts GPT-4 exam accuracy by ~20 percentage points

0.60

0.50

0.60

1

Structured, hybrid retrieval plus an iterate-and-verify loop can raise domain QA accuracy and expert trust without expensive model retraining, lowering deployment risk for regulated or knowledge-heavy applications.

Key finding

TOSRR with GPT-4 raised accuracy on the TCM Medical Licensing Examination dataset.

Numbers: GPT-4 55.83% -> TOSRR 75.67% (+19.84 pp)

ISARA: iteratively self-align an LLM using retrieval-augmented in-context learning and <100 seed examples

0.60

0.70

1

You can improve model safety and truthfulness in new domains with very small labeled seeds and no extra human rules or reward models, cutting annotation cost and speeding deployment.

Key finding

ISARA can sharply reduce harmful outputs on safety prompts.

Numbers: LLaMA-7B harmful rate discrimination: 37.6% → 1.2% (pretrain → ISARA)

Fine-tune a small planning LLM on KG‑derived plans to improve retrieval-augmented QA

0.60

0.50

1

You can make cheaper, smaller LLMs better at multi-step, retrieval-based QA by generating plan labels from an existing knowledge graph and fine-tuning a compact planner; this improves answer accuracy without relying on large teacher models.

Key finding

Fine-tuned planner (Llama3-8B) boosts Exact Match on HotPotQA versus ReAct

Numbers: HotPotQA EM: 0.376 vs ReAct 0.211 (+0.165)

Reduce LVLM hallucinations by retrieving targeted image-text pairs only when the model is uncertain

0.60

1

ARA reduces factually incorrect image answers without costly retraining, so products that must avoid visual misinformation (e.g., medical imaging assistants, robotics, visual search) can improve trust with modest engineering work.

Key finding

Active retrieval (ARA) improves object-presence detection on POPE for LLaVA-1.5.

Numbers: Accuracy 86.50% → 89.43% (Random setting, Table 1)

Short, guided retrieval loops that unify text, tables and KGs for faster, auditable multi-hop QA

0.60

0.70

0.60

0

RELOOP reduces wasted retrieval work and provides explicit provenance. This yields more accurate multi-step answers across mixed data formats while keeping latency and token/tool costs predictable.

Key finding

RELOOP yields higher QA accuracy than strong baselines across heterogeneous benchmarks.

Numbers: HybridQA acc 66.4 / F1 72.1; TAT-QA acc 75.7 / F1 83.5; HotpotQA acc 56.3 / F1 58.6 (Table 2)

Add causal graphs and what-if checks to RAG to reduce hallucinations and improve causal answers

0.40

0.70

0.30

0

If your product needs trustworthy causal answers (for diagnostics, policy, medical reasoning, or financial analysis), adding causal graphs plus counterfactual checks can cut incorrect causal claims and improve interpretability. Expect higher compute and latency costs.

Key finding

Causal-Counterfactual RAG yields substantially higher precision than Regular RAG on evaluated benchmarks.

Numbers: Precision: 80.57 vs 60.13 (Regular RAG)

Atomic fact-checking for medical RAG LLMs boosts factuality and traceability

0.60

0.50

0

Per-claim fact checking makes medical LLM outputs more trustworthy and traceable, reducing clinical risk and improving compliance with guideline-driven standards while enabling on-prem deployment with smaller models.

Key finding

Atomic fact-checking improved final answer quality in the hardest (tumor-board) set by 40%.

Numbers: Overall improved answers: validation 20%, test 12%, tumor-board 40%

PIKE-RAG: make RAG work on industrial, domain-specific queries using 'atomic' knowledge and rationale-aware decomposition

0.70

0.60

0

PIKE-RAG turns heterogeneous, domain-specific documents into a structured KB and iteratively reasons with atomized facts; this reduces incorrect answers in legal, medical, and engineering QA and speeds production deployment of RAG-powered tools.

Key finding

PIKE-RAG improves multi-hop QA accuracy over baselines on HotpotQA.

Numbers: Accuracy 87.6% (PIKE-RAG) vs 82.6% (Naive RAG w/ R)

Use RAG + rewindable alignment (RAIN / MultiRAIN) to make privacy Q&A answers more precise and readable

0.25

0.60

0.40

0

Automated privacy Q&A can be made measurably more accurate and readable by adding alignment modules, but current methods are not yet human-level and are costly to run in real time.

Key finding

Alignment modules (RAIN or MultiRAIN) improved results over a Vanilla RAG baseline on most evaluation metrics.

Numbers: 18/21 metrics favored alignment-enabled systems

bRAGgen: a self-updating RAG system that pulls real-time medical evidence for bariatric surgery Q&A

0.70

0.60

0

bRAGgen reduces the risk of outdated or inaccurate patient guidance by auto-fetching and integrating authoritative evidence, improving answer quality while keeping runtime latency low.

Key finding

bRAGgen with Llama3-8B achieved the highest expert average score across factuality, clinical relevance, and comprehensiveness.

Numbers: Expert avg 4.51 (bRAGgen Llama3-8B) vs 4.05 (best baseline MedGraphRAG); Δ+0.46

Learned router that decides when and where to fetch facts from multiple KBs during stepwise multimodal reasoning

0.60

0.70

0.60

0

Adaptive routing reduces unnecessary retrievals and raises answer quality for mixed text/image/table queries, cutting retrieval cost and improving accuracy for knowledge-heavy apps.

Key finding

R1-Router raises average F1-Recall across evaluated QA benchmarks.

Numbers: Avg F1-Recall 55.93 vs 48.29 (IterRetGen), +7.64 pts

RAGLAB — an open, modular toolkit to reproduce, compare and develop RAG algorithms fairly

0.70

0.60

0

RAGLAB speeds RAG development and fair benchmarking, helping teams pick the right RAG variant and avoid wasted engineering time when reproducing papers.

Key finding

Self-RAG outperforms other reproduced RAG algorithms when paired with a 70B fine-tuned generator.

Numbers: PopQA ACC 48.8 (Self-RAG adaptive) vs 39.6 (NaiveRag) on Llama3-70B

Retrieve similar in-project proofs at each step to boost automated Coq proof synthesis

0.60

0.70

0

Rango automates more Coq proofs than prior tools, lowering manual proof effort and increasing coverage for projects that use Coq, which can reduce verification costs and time-to-audit.

Key finding

Rango proves a larger share of benchmark theorems than prior tools.

Numbers: 3,325 / 10,396 = 32.0% theorems proven on CoqStoq benchmark