Overview
PaperQA uses explicit retrieval, LLM relevance scoring, and iterative tool calls to find and synthesize evidence, improving factual groundings and reducing citation hallucinations compared to plain LLM outputs.
Citations51
Evidence Strength0.80
Confidence0.82
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 55%
Why It Matters For Business
PaperQA shows agentic retrieval plus LLMs can deliver near-expert literature answers with reliable citations and low cost per query, making it practical for automated literature triage, fast reviews, and decision support.
Who Should Care
Summary TLDR
PaperQA is an agent-style Retrieval-Augmented Generation (RAG) system that finds full-text papers, summarizes ranked passages, and composes answers with citations. The authors introduce LitQA, a 50-question biomedical benchmark crafted from post-2021 papers to force retrieval. PaperQA outperforms baseline LLMs and commercial tools on LitQA (69.5% vs human 66.8%), gets 86.3% on a closed-book PubMedQA split (vs GPT-4 57.9%), and produced zero evaluated citation hallucinations (0% on 237 citations). The system is built from modular tools (search, gather evidence, answer), uses LLM relevance scoring, map-reduce summarization, and runs on LangChain with GPT-4/GPT-3.5. Key caveats: LitQA is small,
Problem Statement
Out-of-date or hallucinating LLMs make scientific QA unsafe or useless. We need systems that 1) retrieve up-to-date full-text papers, 2) select and summarize relevant passages, and 3) give answers with reliable citations so researchers can trust and verify results.
Main Contribution
PaperQA: an agentic RAG pipeline that decomposes retrieval into search, gather-evidence (map), and answer (reduce) tools and lets an LLM orchestrate iterative calls.
LitQA: a 50-question biomedical benchmark drawn from post-2021 full-text papers that requires retrieval and synthesis.
Key Findings
PaperQA achieves 69.5% accuracy on LitQA, slightly above human experts.
On a closed-book PubMedQA test, PaperQA scores 86.3% vs GPT-4's 57.9%.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 69.5% (PaperQA) | 66.8% (Human experts) | +2.7 pp | LitQA (50 Qs) | Table 2 reports PaperQA 69.5% and Human 66.8% on LitQA | Table 2 |
| Accuracy | 86.3% (PaperQA) | 57.9% (GPT-4) | +28.4 pp | PubMedQA blind (100 sampled Qs) | Table 5 shows PaperQA 86.3% vs GPT-4 57.9% | Table 5 |
What To Try In 7 Days
Prototype a three-tool agent (search, gather, answer) using LangChain and an embedding model for a focused literature subfield.
Create 20–50 retrieval-based test questions from recent papers and measure accuracy and citation validity.
Add an LLM-based chunk relevance scorer and map-reduce summarization; compare citation hallucination vs plain LLM output.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Reproducibility
Risks & Boundaries
Limitations
Assumes underlying papers are correct; wrong papers lead to wrong answers.
LitQA is small (50 Qs) and focused on biomedical literature, limiting generality.
When Not To Use
If the needed information is in textbooks rather than papers (e.g., exam-style knowledge).
When reliable full-text access to domain papers is unavailable.
Failure Modes
PDF parsing errors yielding garbled chunks, mitigated but not eliminated by summarization.
Citing secondary sources mentioned in a primary source when the secondary is inaccessible.

