Overview
Production Readiness
0.6
Novelty Score
0.55
Cost Impact Score
0.65
Citation Count
39
Why It Matters For Business
LLMs used in healthcare often cite sources that do not back their claims. That creates legal, safety, and trust risks for any product that displays model-cited medical advice.
Summary TLDR
The authors build SourceCheckup, an automated pipeline that (1) generates medical questions from web pages, (2) asks LLMs for answers and sources, and (3) uses GPT-4 as a source verifier. GPT-4's verifier agrees 88% with a panel of doctors. Using a 1,200-question corpus (≈40K statement–source pairs) they evaluate five LLMs. Top results: GPT-4 (RAG) returns valid URLs almost always (99.2%) but only 69.4% of statements and 54.3% of full responses are fully supported by its cited sources. Non‑RAG API models often cite invalid or non-supporting URLs and have much lower support rates (e.g., GPT-4 API response-level support 22.7%; Gemini Pro 7.6%). The paper checks support (is the source evidence)
Problem Statement
LLMs are starting to cite web sources for medical answers. But do the cited sources actually back up the model's claims? Manual expert checks are slow and costly, so we need a scalable, validated way to measure whether LLMs provide verifiable, supporting references.
Main Contribution
SourceCheckup: an end-to-end automated pipeline for generating medical questions and checking if model-cited sources support each statement.
Validation that GPT-4 can act as an automated source verifier: 88% agreement with three US-licensed doctors on 284 statement–source pairs.
A public dataset: 1,200 questions drawn from Mayo Clinic, UpToDate, and Reddit plus a clinician-annotated subset; large-scale evaluation of five top LLMs (≈40K statement–source pairs).
Key Findings
GPT-4 as a verifier closely matches doctors when checking if a source supports a statement.
Many LLM statements lack supporting evidence in their cited sources.
Non‑RAG API models frequently emit invalid or non-supporting URLs and have much lower support rates.
Retrieval augmentation reduces URL hallucination but does not eliminate unsupported statements.
Results
Source URL Validity (GPT-4 RAG)
Statement-level support (GPT-4 RAG)
Response-level support (GPT-4 RAG)
Statement-level support (GPT-4 API)
Response-level support (GPT-4 API)
Who Should Care
What To Try In 7 Days
Run SourceCheckup (or similar) on your LLM output to measure URL validity and support rates.
Add an automated source verification step (use a vetted LLM like GPT-4) and flag unsupported statements for human review.
Prefer RAG plus source verification over plain API models when showing citations to users.
Reproducibility
Data Urls
- Authors state they open-source the curated dataset (see paper for link)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- SourceCheckup measures whether a cited source supports a statement, not whether the statement is objectively true.
- Reference pages come from only three web sources (MayoClinic, UpToDate, Reddit), which limits coverage.
- GPT-4 used for verification could exhibit bias or make errors despite high agreement; still requires human spot checks.
When Not To Use
- Do not treat 'supported by source' as guaranteed clinical correctness without clinician review.
- Do not apply the pipeline as-is to private EHR data or paywalled sources without adapting access rules.
- Do not rely solely on automated verifier for high-stakes medical decisions.
Failure Modes
- LLM returns hallucinated or malformed URLs (invalid or non-existent pages).
- RAG returns a valid URL that does not actually contain the claimed evidence.
- Automated verifier mislabels borderline or implicit evidence; disagreement with experts can occur.
Core Entities
Models
- GPT-4 (RAG)
- GPT-4 (API)
- Claude v2.1
- Mistral Medium
- Gemini Pro
- GPT-4 (as Source Verification model)
Metrics
- Source URL Validity
- Statement-level support
- Response-level support
- Percent of URLs not supporting any statement
Datasets
- MayoClinic web pages
- UpToDate pages
- Reddit r/AskDocs
- SourceCheckup 1,200-question corpus
- Clinician-annotated subset (N=284)
Benchmarks
- SourceCheckup

