Overview
Production Readiness
0.7
Novelty Score
0.55
Cost Impact Score
0.8
Citation Count
51
Why It Matters For Business
PaperQA shows agentic retrieval plus LLMs can deliver near-expert literature answers with reliable citations and low cost per query, making it practical for automated literature triage, fast reviews, and decision support.
Summary TLDR
PaperQA is an agent-style Retrieval-Augmented Generation (RAG) system that finds full-text papers, summarizes ranked passages, and composes answers with citations. The authors introduce LitQA, a 50-question biomedical benchmark crafted from post-2021 papers to force retrieval. PaperQA outperforms baseline LLMs and commercial tools on LitQA (69.5% vs human 66.8%), gets 86.3% on a closed-book PubMedQA split (vs GPT-4 57.9%), and produced zero evaluated citation hallucinations (0% on 237 citations). The system is built from modular tools (search, gather evidence, answer), uses LLM relevance scoring, map-reduce summarization, and runs on LangChain with GPT-4/GPT-3.5. Key caveats: LitQA is small,
Problem Statement
Out-of-date or hallucinating LLMs make scientific QA unsafe or useless. We need systems that 1) retrieve up-to-date full-text papers, 2) select and summarize relevant passages, and 3) give answers with reliable citations so researchers can trust and verify results.
Main Contribution
PaperQA: an agentic RAG pipeline that decomposes retrieval into search, gather-evidence (map), and answer (reduce) tools and lets an LLM orchestrate iterative calls.
LitQA: a 50-question biomedical benchmark drawn from post-2021 full-text papers that requires retrieval and synthesis.
Empirical evaluation showing PaperQA outperforms pre-trained LLMs and commercial literature tools on LitQA and improves closed-book PubMedQA performance.
Analysis showing near-zero citation hallucinations, ablations that identify key components, and cost/time estimates showing cost-efficiency vs humans.
Key Findings
PaperQA achieves 69.5% accuracy on LitQA, slightly above human experts.
On a closed-book PubMedQA test, PaperQA scores 86.3% vs GPT-4's 57.9%.
In citation audits PaperQA showed 0% hallucinated citations on 237 citations evaluated.
Costs and speed: PaperQA averaged ~$0.18 per question and completed the LitQA run in ~2.4 hours, comparable to humans given 2.5 hours.
Results
Accuracy
Accuracy
Citation hallucination rate
Cost per question (API LLM costs)
Who Should Care
What To Try In 7 Days
Prototype a three-tool agent (search, gather, answer) using LangChain and an embedding model for a focused literature subfield.
Create 20–50 retrieval-based test questions from recent papers and measure accuracy and citation validity.
Add an LLM-based chunk relevance scorer and map-reduce summarization; compare citation hallucination vs plain LLM output.
Agent Features
Memory
- retrieval memory (vector DB of 4,000-char chunks)
Planning
- iterative retrieval/search
- map-reduce summarization
Tool Use
- search
- gather_evidence
- answer_question
Frameworks
- LangChain
- OpenAI LLMs
Is Agentic
true
Architectures
- LLM agent orchestrator
- tool-based modular pipeline
Collaboration
- single-agent (multiple LLM instances)
Optimization Features
Token Efficiency
- chunking and map-reduce to limit context tokens
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Assumes underlying papers are correct; wrong papers lead to wrong answers.
- LitQA is small (50 Qs) and focused on biomedical literature, limiting generality.
- Search, paper access, and PDF parsing can fail and reduce performance.
- Multiple LLM prompts and instances add system complexity and tuning burden.
When Not To Use
- If the needed information is in textbooks rather than papers (e.g., exam-style knowledge).
- When reliable full-text access to domain papers is unavailable.
- For real-time low-latency needs—PaperQA pipelines can take hours for large batches.
Failure Modes
- PDF parsing errors yielding garbled chunks, mitigated but not eliminated by summarization.
- Citing secondary sources mentioned in a primary source when the secondary is inaccessible.
- Search engine limitations causing missed papers (Semantic Scholar vs Google differences).
Core Entities
Models
- GPT-4
- GPT-3.5-turbo
- Claude-2
- AutoGPT
Metrics
- Accuracy
- precision (correct-sure)
- hallucination rate
- retrieval AUC
- Cramer's V (categorical correlation)
Datasets
- LitQA
- PubMedQA
- MedQA
- BioASQ
Benchmarks
- LitQA
- PubMedQA blind

