PaperQA: an agentic RAG that retrieves full-text papers, cites sources, and matches experts on a new LitQA benchmark

December 8, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.55

Cost Impact Score

0.8

Citation Count

51

Authors

Jakub Lála, Odhran O'Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G. Rodriques, Andrew D. White

Links

Abstract / PDF

Why It Matters For Business

PaperQA shows agentic retrieval plus LLMs can deliver near-expert literature answers with reliable citations and low cost per query, making it practical for automated literature triage, fast reviews, and decision support.

Summary TLDR

PaperQA is an agent-style Retrieval-Augmented Generation (RAG) system that finds full-text papers, summarizes ranked passages, and composes answers with citations. The authors introduce LitQA, a 50-question biomedical benchmark crafted from post-2021 papers to force retrieval. PaperQA outperforms baseline LLMs and commercial tools on LitQA (69.5% vs human 66.8%), gets 86.3% on a closed-book PubMedQA split (vs GPT-4 57.9%), and produced zero evaluated citation hallucinations (0% on 237 citations). The system is built from modular tools (search, gather evidence, answer), uses LLM relevance scoring, map-reduce summarization, and runs on LangChain with GPT-4/GPT-3.5. Key caveats: LitQA is small,

Problem Statement

Out-of-date or hallucinating LLMs make scientific QA unsafe or useless. We need systems that 1) retrieve up-to-date full-text papers, 2) select and summarize relevant passages, and 3) give answers with reliable citations so researchers can trust and verify results.

Main Contribution

PaperQA: an agentic RAG pipeline that decomposes retrieval into search, gather-evidence (map), and answer (reduce) tools and lets an LLM orchestrate iterative calls.

LitQA: a 50-question biomedical benchmark drawn from post-2021 full-text papers that requires retrieval and synthesis.

Empirical evaluation showing PaperQA outperforms pre-trained LLMs and commercial literature tools on LitQA and improves closed-book PubMedQA performance.

Analysis showing near-zero citation hallucinations, ablations that identify key components, and cost/time estimates showing cost-efficiency vs humans.

Key Findings

PaperQA achieves 69.5% accuracy on LitQA, slightly above human experts.

NumbersPaperQA 69.5% vs Human 66.8% (LitQA, Table 2)

On a closed-book PubMedQA test, PaperQA scores 86.3% vs GPT-4's 57.9%.

NumbersPaperQA 86.3% vs GPT-4 57.9% (PubMedQA blind, Table 5)

In citation audits PaperQA showed 0% hallucinated citations on 237 citations evaluated.

NumbersPaperQA 0% hallucinations (N=237); GPT-3.5 full hallucination 33.75% (N=80), GPT-4 full 29.41% (N=51) (Table 4)

Costs and speed: PaperQA averaged ~$0.18 per question and completed the LitQA run in ~2.4 hours, comparable to humans given 2.5 hours.

Numbers$0.18 per question; ~2.4 hours total vs humans 2.5 hours (Section 5.2)

Results

Accuracy

Value69.5% (PaperQA)

Baseline66.8% (Human experts)

Accuracy

Value86.3% (PaperQA)

Baseline57.9% (GPT-4)

Citation hallucination rate

Value0% (PaperQA, N=237 citations)

BaselineGPT-4 full hallucination 29.41% (N=51)

Cost per question (API LLM costs)

Value$0.18 per question (estimated Sept 2023 pricing)

BaselineHuman time cost (not directly comparable)

Who Should Care

What To Try In 7 Days

Prototype a three-tool agent (search, gather, answer) using LangChain and an embedding model for a focused literature subfield.

Create 20–50 retrieval-based test questions from recent papers and measure accuracy and citation validity.

Add an LLM-based chunk relevance scorer and map-reduce summarization; compare citation hallucination vs plain LLM output.

Agent Features

Memory

  • retrieval memory (vector DB of 4,000-char chunks)

Planning

  • iterative retrieval/search
  • map-reduce summarization

Tool Use

  • search
  • gather_evidence
  • answer_question

Frameworks

  • LangChain
  • OpenAI LLMs

Is Agentic

true

Architectures

  • LLM agent orchestrator
  • tool-based modular pipeline

Collaboration

  • single-agent (multiple LLM instances)

Optimization Features

Token Efficiency

  • chunking and map-reduce to limit context tokens

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Assumes underlying papers are correct; wrong papers lead to wrong answers.
  • LitQA is small (50 Qs) and focused on biomedical literature, limiting generality.
  • Search, paper access, and PDF parsing can fail and reduce performance.
  • Multiple LLM prompts and instances add system complexity and tuning burden.

When Not To Use

  • If the needed information is in textbooks rather than papers (e.g., exam-style knowledge).
  • When reliable full-text access to domain papers is unavailable.
  • For real-time low-latency needs—PaperQA pipelines can take hours for large batches.

Failure Modes

  • PDF parsing errors yielding garbled chunks, mitigated but not eliminated by summarization.
  • Citing secondary sources mentioned in a primary source when the secondary is inaccessible.
  • Search engine limitations causing missed papers (Semantic Scholar vs Google differences).

Core Entities

Models

  • GPT-4
  • GPT-3.5-turbo
  • Claude-2
  • AutoGPT

Metrics

  • Accuracy
  • precision (correct-sure)
  • hallucination rate
  • retrieval AUC
  • Cramer's V (categorical correlation)

Datasets

  • LitQA
  • PubMedQA
  • MedQA
  • BioASQ

Benchmarks

  • LitQA
  • PubMedQA blind