Agentic RAG Papers — Parsed & Scored for Practitioners

PaperQA: an agentic RAG that retrieves full-text papers, cites sources, and matches experts on a new LitQA benchmark

0.70

0.55

0.80

51

PaperQA shows agentic retrieval plus LLMs can deliver near-expert literature answers with reliable citations and low cost per query, making it practical for automated literature triage, fast reviews, and decision support.

Key finding

PaperQA achieves 69.5% accuracy on LitQA, slightly above human experts.

Numbers: PaperQA 69.5% vs Human 66.8% (LitQA, Table 2)

Survey: five practical ways LLMs are used to plan agent behavior

0.40

0.60

29

LLM-driven planning can automate complex multi-step tasks, but higher success usually requires more model calls and tokens, so balance accuracy needs with token cost and latency.

Key finding

Spending more tokens (more generated ‘thinking’) tends to raise success.

Numbers: ALFWorld SR: ReAct 0.57 -> Reflexion 0.71; EX($): 152.18 -> 220.17 (Table 2).

Hierarchical ReAct agents ground LLMs to Materials Project data and run language-driven simulations with near-zero hallucination

0.70

0.55

0.60

21

Grounding LLMs to authoritative databases and tools reduces dangerous hallucinations and lets teams automate reproducible workflows (data fetch → simulation → analysis) without model fine-tuning, cutting verification time and accelerating materials R&D.

Key finding

LLaMP reduces bulk-modulus prediction error compared to web-augmented GPT-4 and other baselines.

Numbers: Bulk modulus MAE = 14.57 GPa (LLaMP) vs ~41 GPa (GPT-4/GPT-4+Serp) on evaluated set

AgentPoison: a stealthy backdoor that poisons agent memories or RAG to hijack LLM agents

0.70

9

If agents fetch data from third-party or writable corpora, an attacker can inject a few poisoned records to trigger dangerous actions while leaving overall accuracy unchanged, creating a low-noise safety and legal risk.

Key finding

AGENTPOISON forces retrieval of poisoned demonstrations with high probability.

Numbers: Average ASR-r ≈ 81.2% (retrieval success)

A-MEM: LLM agents that build and evolve a Zettelkasten-style linked memory

0.70

0.60

0.80

6

A-MEM cuts token and inference cost by ~85–93% per memory while improving multi-session reasoning, making long-term conversational agents materially cheaper and more capable to run at scale.

Key finding

A-MEM improves DialSim QA accuracy over baselines.

Numbers: DialSim F1: A-MEM 3.45 vs LoCoMo 2.55 (+35%) vs MemGPT 1.18 (+192%)

Use past successful episodes as memory to boost LLM agent planning in text and vision tasks

0.60

0.70

0.50

5

RAP turns past successful runs into reusable context that raises accuracy for multi-step text and embodied agents, reducing trial-and-error and speeding up deployment in web automation and robotic workflows.

Key finding

RAP raises ALFWorld success from 52.2% (ReAct, GPT-3.5) to 85.8% with GPT-3.5.

Numbers: 52.2% → 85.8% (ALFWorld, Table 1)

Hierarchical Agentic RAG: small LMs + prompt pools to boost forecasting, anomaly detection, and imputation

0.50

0.60

0.50

4

A modular Agentic-RAG can reduce forecasting errors and improve anomaly detection on operational time-series (traffic, industrial telemetry), enabling better planning and faster incident detection while allowing independent updates to sub-modules.

Key finding

Agentic-RAG reduces forecasting error on traffic benchmarks.

Numbers: PEMS-BAY Horizon@3 RMSE 1.62 vs DGCRN 2.69 (Table 4)

Plug small, specialized LMs ('knowledge cards') into black‑box LLMs to add updatable, domain knowledge

0.60

0.70

4

You can update or patch a deployed black-box LLM by adding small domain models instead of retraining a giant model, cutting cost and latency of knowledge updates.

Key finding

Knowledge Card improves a black-box LLM (Codex) on general knowledge QA (MMLU).

Numbers: MMLU overall accuracy: Codex -> KNOWLEDGE CARD (top-down exp) +6.6%

Plan LLM 'thoughts' with MCTS to answer private medical records safely

0.60

0.75

0.60

3

RATP lets organizations use private patient records at inference time without training LLMs on the data, improving accuracy and traceability while avoiding training-time privacy leakage and large retraining costs.

Key finding

RATP (MCTS + model estimator) greatly improves QA accuracy on private EMRs compared to standard RAG.

Numbers: emrQA exact-match: RAG 24% → RATP 71% (+47 pp)

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

0.60

3

Structure-aware embeddings let search and agents find chemical analogs and spectra faster, cutting researcher time for design and analysis and enabling automated, multimodal retrieval inside lab-facing agent workflows.

Key finding

MoLFormer embeddings retrieve structurally close small-molecule analogs even when fingerprint metrics disagree.

Numbers: 2.5M small-molecule collection; cosine similarity up to 1.00 for identical hits

Make long-step reasoning models ask the web when they’re unsure and inject concise, refined facts back into the chain

0.70

0.60

3

Search-o1 lets deployed reasoning models fetch and condense live web facts as they reason, raising accuracy on complex, multi-step queries while cutting noise from raw documents.

Key finding

Search-o1 improves average accuracy across five complex reasoning datasets compared to agentic-RAG and direct reasoning.

Numbers: avg +4.7% vs RAgent-QwQ-32B; +3.1% vs QwQ-32B

Use retrieved introspective examples + conformal calibration so robots ask for clarification only when tasks are truly ambiguous

0.60

0.70

0.50

3

IntroPlan reduces unnecessary user queries and unsafe actions by aligning model uncertainty to task ambiguity; it improves action precision and safety while using modest extra compute and a small curated knowledge base.

Key finding

Direct IntroPlan (no conformal) yields much more precise prediction sets on Safe Mobile Manipulation with GPT-4.

Numbers: Success Rate 96.5%, Exact Set Rate 93.0% (Table 1)

MEDCO: a multi-agent copilot that trains medical students via patient, radiologist, and expert role-play

0.30

0.60

0.40

2

MEDCO shows multi-role simulation plus a small retrieval memory can lift weaker models to near-strong performance in simulated medical training, which can differentiate education products and reduce instructor bandwidth.

Key finding

MEDCO raises average expert-rated diagnostic score (HDE) of a weak student (GPT-3.5).

Numbers: HDE avg: 1.965 → 2.169 (knowledge) → 2.299 (peer discussion)

Survey: how to run reasoning-capable LLMs and autonomous agents on memory- and power-limited edge devices

0.70

0.60

0.80

2

Edge deployment cuts latency, reduces data exposure, and lowers bandwidth costs; cross-layer co-design (model+runtime+hardware) is required to preserve multi-step reasoning while meeting product SLAs.

Key finding

Low-bit quantization yields large memory compression but needs careful validation for reasoning tasks.

Numbers: 4–8× compression from 8-/4-bit quantization (surveyed reports)

Synthesize agent–environment trajectories and rewrite tasks (backward construction) to adapt LLM agents without human labels

0.70

0.60

0.70

2

You can adapt LLM agents to specific apps without costly human labels; synthesizing and indexing environment-specific interactions boosts accuracy and reduces run-time planning costs.

Key finding

In-context learning (ICL) with synthesized data improves Claude-3.5-sonnet on OSWorld from 12.4 to 22.5.

Numbers: 12.4 → 22.5 (OSWorld, Claude ICL)

Interpret chest X‑ray reports by combining concept bottlenecks with a multi‑agent retrieval system

0.40

0.60

0.50

2

Shows a practical path to make automated CXR outputs explainable: you can keep high accuracy while surfacing concept evidence and improve report quality with a multi-agent retrieval pipeline.

Key finding

CBM classification accuracy on COVID-QU

Numbers: 81% accuracy on Covid-QU (Table 1)

Practical survey: a five‑phase Query Optimization Lifecycle and taxonomy for LLM-based RAG systems

0.60

0.50

0.70

2

Better queries reduce hallucination and improve downstream answer quality; matching optimization to query types saves API cost and improves customer trust.

Key finding

Query optimization is essential: retrieval quality strongly determines final answer quality in RAG.

Agentic flows create 25M synthetic instruction pairs to teach skills and boost a 7B model across many benchmarks

0.60

0.70

0.60

2

Agentic flows automate creation of large, diverse instruction data from raw web/code sources, enabling faster model skill updates without manual prompt engineering or heavy labeling.

Key finding

AgentInstruct produced roughly 25.8 million instruction–response pairs used for post-training.

Numbers: ≈25.8M paired instructions (22M agentic + 3.8M external)

Use an LLM-based agent with memory, world knowledge, and a graph tool to improve zero-shot next-location prediction

0.50

0.70

0.30

1

AgentMove improves zero-shot location ranking across cities without retraining a local model, so companies can prototype personalized recommendations or prefetching where labeled local mobility data is scarce.

Key finding

AgentMove wins most metrics versus baselines in zero-shot tests

Numbers: Best results in 8 of 12 metrics; improvements range 3.33%–8.57%

Weighted RAG plus LLaMA self-evaluation to speed and improve enterprise troubleshooting

0.60

0.50

0.60

1

A weighted RAG plus self-evaluation can cut misdiagnoses and speed resolution on large enterprise knowledge bases, improving service SLAs and reducing human time-to-fix.

Key finding

Weighted RAG plus self-evaluation achieves higher troubleshooting accuracy than baselines

Numbers: Accuracy: 90.8% (proposed) vs 85.2% (standard RAG) vs 76.1% (BM25)

DEPSRAG: an agentic RAG system that builds dependency knowledge graphs and uses a critic loop to improve dependency reasoning

0.40

0.50

1

DepsRAG automates dependency analysis and vulnerability lookup, cutting manual checks that delay library approvals and enabling faster, evidence-backed decisions.

Key finding

Adding a Critic-Agent raised answer precision from 13.3% to 40% on evaluated tasks.

Numbers: 13.3% → 40% precision (ten iterations, three tasks)

How large AI models and agentic systems can power intelligent 6G networks

0.50

0.70

1

Agentic LAMs can automate network planning, resource scheduling and incident response in 6G, lowering human ops and speeding time-to-service while requiring investment in data, compute, and evaluation.

Key finding

Large models are already large and influential in capability.

Numbers: GPT-3 ~175B parameters (cited)

Treat RAG modules as cooperative RL agents (MAPPO) to raise final-answer F1.

0.60

0.70

0.50

1

Jointly training retrieval pipeline modules toward the final answer quality raises factuality and gives consistent F1 gains, with no extra inference cost, making it attractive when accuracy matters more than extra training compute.

Key finding

Joint multi-agent optimization improves final-answer F1 over strong baselines on three QA benchmarks.

Numbers: HotpotQA F1 +1.80; 2Wiki F1 +1.89; AmbigQA F1 +2.67 (Table 1).

VERDICT: unify diversification and verification to produce grounded clarifications in RAG

0.70

0.60

0.70

1

Cuts repeated search calls and produces clarifications you can cite, improving enterprise search accuracy and user trust while lowering retrieval cost.

Key finding

VERDICT yields large average grounded-F1 gains vs strong baselines.

Numbers: avg +23% G-F1 across backbone LLMs