PaperQA: an agentic RAG that retrieves full-text papers, cites sources, and matches experts on a new LitQA benchmark

Overview

Decision SnapshotReady For Pilot

PaperQA uses explicit retrieval, LLM relevance scoring, and iterative tool calls to find and synthesize evidence, improving factual groundings and reducing citation hallucinations compared to plain LLM outputs.

Citations51

Evidence Strength0.80

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 55%

Authors

Jakub Lála, Odhran O'Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G. Rodriques, Andrew D. White

Links

Abstract / PDF

Why It Matters For Business

PaperQA shows agentic retrieval plus LLMs can deliver near-expert literature answers with reliable citations and low cost per query, making it practical for automated literature triage, fast reviews, and decision support.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

PaperQA is an agent-style Retrieval-Augmented Generation (RAG) system that finds full-text papers, summarizes ranked passages, and composes answers with citations. The authors introduce LitQA, a 50-question biomedical benchmark crafted from post-2021 papers to force retrieval. PaperQA outperforms baseline LLMs and commercial tools on LitQA (69.5% vs human 66.8%), gets 86.3% on a closed-book PubMedQA split (vs GPT-4 57.9%), and produced zero evaluated citation hallucinations (0% on 237 citations). The system is built from modular tools (search, gather evidence, answer), uses LLM relevance scoring, map-reduce summarization, and runs on LangChain with GPT-4/GPT-3.5. Key caveats: LitQA is small,

Problem Statement

Out-of-date or hallucinating LLMs make scientific QA unsafe or useless. We need systems that 1) retrieve up-to-date full-text papers, 2) select and summarize relevant passages, and 3) give answers with reliable citations so researchers can trust and verify results.

Main Contribution

PaperQA: an agentic RAG pipeline that decomposes retrieval into search, gather-evidence (map), and answer (reduce) tools and lets an LLM orchestrate iterative calls.

LitQA: a 50-question biomedical benchmark drawn from post-2021 full-text papers that requires retrieval and synthesis.

Key Findings

PaperQA achieves 69.5% accuracy on LitQA, slightly above human experts.

NumbersPaperQA 69.5% vs Human 66.8% (LitQA, Table 2)

Practical UseA retrieval-first agent can match or exceed expert-level performance on literature-based multiple-choice questions; consider using an agentic RAG for researcher triage and literature QA.

Evidence RefTable 2

On a closed-book PubMedQA test, PaperQA scores 86.3% vs GPT-4's 57.9%.

NumbersPaperQA 86.3% vs GPT-4 57.9% (PubMedQA blind, Table 5)

Practical UseAdding full-text retrieval and evidence synthesis dramatically improves domain QA when the LLM lacks updated parametric knowledge; plug a RAG agent into domain workflows for factual questions.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	69.5% (PaperQA)	66.8% (Human experts)	+2.7 pp	LitQA (50 Qs)	Table 2 reports PaperQA 69.5% and Human 66.8% on LitQA	Table 2
Accuracy	86.3% (PaperQA)	57.9% (GPT-4)	+28.4 pp	PubMedQA blind (100 sampled Qs)	Table 5 shows PaperQA 86.3% vs GPT-4 57.9%	Table 5

What To Try In 7 Days

Prototype a three-tool agent (search, gather, answer) using LangChain and an embedding model for a focused literature subfield.

Create 20–50 retrieval-based test questions from recent papers and measure accuracy and citation validity.

Add an LLM-based chunk relevance scorer and map-reduce summarization; compare citation hallucination vs plain LLM output.

Agent Features

Memory

retrieval memory (vector DB of 4,000-char chunks)

Planning

iterative retrieval/searchmap-reduce summarization

Tool Use

searchgather_evidenceanswer_question

Frameworks

LangChainOpenAI LLMs

Is Agentic

Yes

Architectures

LLM agent orchestratortool-based modular pipeline

Collaboration

single-agent (multiple LLM instances)

Optimization Features

Token Efficiency

chunking and map-reduce to limit context tokens

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Assumes underlying papers are correct; wrong papers lead to wrong answers.

LitQA is small (50 Qs) and focused on biomedical literature, limiting generality.

When Not To Use

If the needed information is in textbooks rather than papers (e.g., exam-style knowledge).

When reliable full-text access to domain papers is unavailable.

Failure Modes

PDF parsing errors yielding garbled chunks, mitigated but not eliminated by summarization.

Citing secondary sources mentioned in a primary source when the secondary is inaccessible.

Core Entities

Models

GPT-4GPT-3.5-turboClaude-2AutoGPT

Metrics

Accuracyprecision (correct-sure)hallucination rateretrieval AUCCramer's V (categorical correlation)

Datasets

LitQAPubMedQAMedQABioASQ

Benchmarks

LitQAPubMedQA blind

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

PaperQA achieves 69.5% accuracy on LitQA, slightly above human experts.

On a closed-book PubMedQA test, PaperQA scores 86.3% vs GPT-4's 57.9%.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Key finding

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding