Automated audit finds many medical LLM answers lack supporting sources

Overview

Decision SnapshotReady For Pilot

The paper provides a validated, automated pipeline and many quantitative results; the evidence is strengthened by expert doctor agreement and large-scale sampling.

Citations39

Evidence Strength0.80

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 60%

Novelty: 55%

Authors

Kevin Wu, Eric Wu, Ally Cassasola, Angela Zhang, Kevin Wei, Teresa Nguyen, Sith Riantawan, Patricia Shi Riantawan, Daniel E. Ho, James Zou

Links

Abstract / PDF / Data

Why It Matters For Business

LLMs used in healthcare often cite sources that do not back their claims. That creates legal, safety, and trust risks for any product that displays model-cited medical advice.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

The authors build SourceCheckup, an automated pipeline that (1) generates medical questions from web pages, (2) asks LLMs for answers and sources, and (3) uses GPT-4 as a source verifier. GPT-4's verifier agrees 88% with a panel of doctors. Using a 1,200-question corpus (≈40K statement–source pairs) they evaluate five LLMs. Top results: GPT-4 (RAG) returns valid URLs almost always (99.2%) but only 69.4% of statements and 54.3% of full responses are fully supported by its cited sources. Non‑RAG API models often cite invalid or non-supporting URLs and have much lower support rates (e.g., GPT-4 API response-level support 22.7%; Gemini Pro 7.6%). The paper checks support (is the source evidence)

Problem Statement

LLMs are starting to cite web sources for medical answers. But do the cited sources actually back up the model's claims? Manual expert checks are slow and costly, so we need a scalable, validated way to measure whether LLMs provide verifiable, supporting references.

Main Contribution

SourceCheckup: an end-to-end automated pipeline for generating medical questions and checking if model-cited sources support each statement.

Validation that GPT-4 can act as an automated source verifier: 88% agreement with three US-licensed doctors on 284 statement–source pairs.

Key Findings

GPT-4 as a verifier closely matches doctors when checking if a source supports a statement.

Numbers88.0% agreement (N=284) vs doctor consensus

Practical UseYou can use GPT-4 to scale source‑support checks, but retain spot expert review for high‑risk cases.

Evidence RefSection 4.1.2, Figure 4, Table 5

Many LLM statements lack supporting evidence in their cited sources.

NumbersGPT-4 (RAG) statement support 69.4%; response-level support 54.3%

Practical UseEven web‑enabled LLMs often make statements that their own citations do not back; verify citations before clinical use.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Source URL Validity (GPT-4 RAG)	0.992 (99.2%)	—	—	all sources, Table 4	Table 4 shows URL validity for GPT-4 (RAG).	Table 4
Statement-level support (GPT-4 RAG)	0.694 (69.4%)	—	—	all datasets	Table 4 reports statement-level support per model.	Table 4

What To Try In 7 Days

Run SourceCheckup (or similar) on your LLM output to measure URL validity and support rates.

Add an automated source verification step (use a vetted LLM like GPT-4) and flag unsupported statements for human review.

Prefer RAG plus source verification over plain API models when showing citations to users.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

Authors state they open-source the curated dataset (see paper for link)

Risks & Boundaries

Limitations

SourceCheckup measures whether a cited source supports a statement, not whether the statement is objectively true.

Reference pages come from only three web sources (MayoClinic, UpToDate, Reddit), which limits coverage.

When Not To Use

Do not treat 'supported by source' as guaranteed clinical correctness without clinician review.

Do not apply the pipeline as-is to private EHR data or paywalled sources without adapting access rules.

Failure Modes

LLM returns hallucinated or malformed URLs (invalid or non-existent pages).

RAG returns a valid URL that does not actually contain the claimed evidence.

Core Entities

Models

GPT-4 (RAG)GPT-4 (API)Claude v2.1Mistral MediumGemini ProGPT-4 (as Source Verification model)

Metrics

Source URL ValidityStatement-level supportResponse-level supportPercent of URLs not supporting any statement

Datasets

MayoClinic web pagesUpToDate pagesReddit r/AskDocsSourceCheckup 1,200-question corpusClinician-annotated subset (N=284)

Benchmarks

SourceCheckup

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 as a verifier closely matches doctors when checking if a source supports a statement.

Many LLM statements lack supporting evidence in their cited sources.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding

DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

Key finding

An open leaderboard that measures LLM hallucinations across 15 tasks and 20 models

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding