Overview
The paper provides a validated, automated pipeline and many quantitative results; the evidence is strengthened by expert doctor agreement and large-scale sampling.
Citations39
Evidence Strength0.80
Confidence0.88
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 65%
Production readiness: 60%
Novelty: 55%
Why It Matters For Business
LLMs used in healthcare often cite sources that do not back their claims. That creates legal, safety, and trust risks for any product that displays model-cited medical advice.
Who Should Care
Summary TLDR
The authors build SourceCheckup, an automated pipeline that (1) generates medical questions from web pages, (2) asks LLMs for answers and sources, and (3) uses GPT-4 as a source verifier. GPT-4's verifier agrees 88% with a panel of doctors. Using a 1,200-question corpus (≈40K statement–source pairs) they evaluate five LLMs. Top results: GPT-4 (RAG) returns valid URLs almost always (99.2%) but only 69.4% of statements and 54.3% of full responses are fully supported by its cited sources. Non‑RAG API models often cite invalid or non-supporting URLs and have much lower support rates (e.g., GPT-4 API response-level support 22.7%; Gemini Pro 7.6%). The paper checks support (is the source evidence)
Problem Statement
LLMs are starting to cite web sources for medical answers. But do the cited sources actually back up the model's claims? Manual expert checks are slow and costly, so we need a scalable, validated way to measure whether LLMs provide verifiable, supporting references.
Main Contribution
SourceCheckup: an end-to-end automated pipeline for generating medical questions and checking if model-cited sources support each statement.
Validation that GPT-4 can act as an automated source verifier: 88% agreement with three US-licensed doctors on 284 statement–source pairs.
Key Findings
GPT-4 as a verifier closely matches doctors when checking if a source supports a statement.
Many LLM statements lack supporting evidence in their cited sources.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Source URL Validity (GPT-4 RAG) | 0.992 (99.2%) | — | — | all sources, Table 4 | Table 4 shows URL validity for GPT-4 (RAG). | Table 4 |
| Statement-level support (GPT-4 RAG) | 0.694 (69.4%) | — | — | all datasets | Table 4 reports statement-level support per model. | Table 4 |
What To Try In 7 Days
Run SourceCheckup (or similar) on your LLM output to measure URL validity and support rates.
Add an automated source verification step (use a vetted LLM like GPT-4) and flag unsupported statements for human review.
Prefer RAG plus source verification over plain API models when showing citations to users.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
SourceCheckup measures whether a cited source supports a statement, not whether the statement is objectively true.
Reference pages come from only three web sources (MayoClinic, UpToDate, Reddit), which limits coverage.
When Not To Use
Do not treat 'supported by source' as guaranteed clinical correctness without clinician review.
Do not apply the pipeline as-is to private EHR data or paywalled sources without adapting access rules.
Failure Modes
LLM returns hallucinated or malformed URLs (invalid or non-existent pages).
RAG returns a valid URL that does not actually contain the claimed evidence.

