PaperQA: an agentic RAG that retrieves full-text papers, cites sources, and matches experts on a new LitQA benchmark

December 8, 20238 min

Overview

Decision SnapshotReady For Pilot

PaperQA uses explicit retrieval, LLM relevance scoring, and iterative tool calls to find and synthesize evidence, improving factual groundings and reducing citation hallucinations compared to plain LLM outputs.

Citations51

Evidence Strength0.80

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 55%

Authors

Jakub Lála, Odhran O'Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G. Rodriques, Andrew D. White

Links

Abstract / PDF

Why It Matters For Business

PaperQA shows agentic retrieval plus LLMs can deliver near-expert literature answers with reliable citations and low cost per query, making it practical for automated literature triage, fast reviews, and decision support.

Who Should Care

Summary TLDR

PaperQA is an agent-style Retrieval-Augmented Generation (RAG) system that finds full-text papers, summarizes ranked passages, and composes answers with citations. The authors introduce LitQA, a 50-question biomedical benchmark crafted from post-2021 papers to force retrieval. PaperQA outperforms baseline LLMs and commercial tools on LitQA (69.5% vs human 66.8%), gets 86.3% on a closed-book PubMedQA split (vs GPT-4 57.9%), and produced zero evaluated citation hallucinations (0% on 237 citations). The system is built from modular tools (search, gather evidence, answer), uses LLM relevance scoring, map-reduce summarization, and runs on LangChain with GPT-4/GPT-3.5. Key caveats: LitQA is small,

Problem Statement

Out-of-date or hallucinating LLMs make scientific QA unsafe or useless. We need systems that 1) retrieve up-to-date full-text papers, 2) select and summarize relevant passages, and 3) give answers with reliable citations so researchers can trust and verify results.

Main Contribution

PaperQA: an agentic RAG pipeline that decomposes retrieval into search, gather-evidence (map), and answer (reduce) tools and lets an LLM orchestrate iterative calls.

LitQA: a 50-question biomedical benchmark drawn from post-2021 full-text papers that requires retrieval and synthesis.

Key Findings

PaperQA achieves 69.5% accuracy on LitQA, slightly above human experts.

NumbersPaperQA 69.5% vs Human 66.8% (LitQA, Table 2)

Practical UseA retrieval-first agent can match or exceed expert-level performance on literature-based multiple-choice questions; consider using an agentic RAG for researcher triage and literature QA.

Evidence RefTable 2

On a closed-book PubMedQA test, PaperQA scores 86.3% vs GPT-4's 57.9%.

NumbersPaperQA 86.3% vs GPT-4 57.9% (PubMedQA blind, Table 5)

Practical UseAdding full-text retrieval and evidence synthesis dramatically improves domain QA when the LLM lacks updated parametric knowledge; plug a RAG agent into domain workflows for factual questions.

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy69.5% (PaperQA)66.8% (Human experts)+2.7 ppLitQA (50 Qs)Table 2 reports PaperQA 69.5% and Human 66.8% on LitQATable 2
Accuracy86.3% (PaperQA)57.9% (GPT-4)+28.4 ppPubMedQA blind (100 sampled Qs)Table 5 shows PaperQA 86.3% vs GPT-4 57.9%Table 5

What To Try In 7 Days

Prototype a three-tool agent (search, gather, answer) using LangChain and an embedding model for a focused literature subfield.

Create 20–50 retrieval-based test questions from recent papers and measure accuracy and citation validity.

Add an LLM-based chunk relevance scorer and map-reduce summarization; compare citation hallucination vs plain LLM output.

Agent Features

Memory
retrieval memory (vector DB of 4,000-char chunks)
Planning
iterative retrieval/searchmap-reduce summarization
Tool Use
searchgather_evidenceanswer_question
Frameworks
LangChainOpenAI LLMs
Is Agentic

Yes

Architectures
LLM agent orchestratortool-based modular pipeline
Collaboration
single-agent (multiple LLM instances)

Optimization Features

Token Efficiency
chunking and map-reduce to limit context tokens

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Assumes underlying papers are correct; wrong papers lead to wrong answers.

LitQA is small (50 Qs) and focused on biomedical literature, limiting generality.

When Not To Use

If the needed information is in textbooks rather than papers (e.g., exam-style knowledge).

When reliable full-text access to domain papers is unavailable.

Failure Modes

PDF parsing errors yielding garbled chunks, mitigated but not eliminated by summarization.

Citing secondary sources mentioned in a primary source when the secondary is inaccessible.

Core Entities

Models

GPT-4GPT-3.5-turboClaude-2AutoGPT

Metrics

Accuracyprecision (correct-sure)hallucination rateretrieval AUCCramer's V (categorical correlation)

Datasets

LitQAPubMedQAMedQABioASQ

Benchmarks

LitQAPubMedQA blind