Automated audit finds many medical LLM answers lack supporting sources

February 3, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.55

Cost Impact Score

0.65

Citation Count

39

Authors

Kevin Wu, Eric Wu, Ally Cassasola, Angela Zhang, Kevin Wei, Teresa Nguyen, Sith Riantawan, Patricia Shi Riantawan, Daniel E. Ho, James Zou

Links

Abstract / PDF

Why It Matters For Business

LLMs used in healthcare often cite sources that do not back their claims. That creates legal, safety, and trust risks for any product that displays model-cited medical advice.

Summary TLDR

The authors build SourceCheckup, an automated pipeline that (1) generates medical questions from web pages, (2) asks LLMs for answers and sources, and (3) uses GPT-4 as a source verifier. GPT-4's verifier agrees 88% with a panel of doctors. Using a 1,200-question corpus (≈40K statement–source pairs) they evaluate five LLMs. Top results: GPT-4 (RAG) returns valid URLs almost always (99.2%) but only 69.4% of statements and 54.3% of full responses are fully supported by its cited sources. Non‑RAG API models often cite invalid or non-supporting URLs and have much lower support rates (e.g., GPT-4 API response-level support 22.7%; Gemini Pro 7.6%). The paper checks support (is the source evidence)

Problem Statement

LLMs are starting to cite web sources for medical answers. But do the cited sources actually back up the model's claims? Manual expert checks are slow and costly, so we need a scalable, validated way to measure whether LLMs provide verifiable, supporting references.

Main Contribution

SourceCheckup: an end-to-end automated pipeline for generating medical questions and checking if model-cited sources support each statement.

Validation that GPT-4 can act as an automated source verifier: 88% agreement with three US-licensed doctors on 284 statement–source pairs.

A public dataset: 1,200 questions drawn from Mayo Clinic, UpToDate, and Reddit plus a clinician-annotated subset; large-scale evaluation of five top LLMs (≈40K statement–source pairs).

Key Findings

GPT-4 as a verifier closely matches doctors when checking if a source supports a statement.

Numbers88.0% agreement (N=284) vs doctor consensus

Many LLM statements lack supporting evidence in their cited sources.

NumbersGPT-4 (RAG) statement support 69.4%; response-level support 54.3%

Non‑RAG API models frequently emit invalid or non-supporting URLs and have much lower support rates.

NumbersGPT-4 (API) response-level support 22.7%; Gemini Pro 7.6%; URL validity 41.3%–69.2% across APIs

Retrieval augmentation reduces URL hallucination but does not eliminate unsupported statements.

NumbersGPT-4 (RAG) URL validity 99.2% but ~30% of statements unsupported

Results

Source URL Validity (GPT-4 RAG)

Value0.992 (99.2%)

Statement-level support (GPT-4 RAG)

Value0.694 (69.4%)

Response-level support (GPT-4 RAG)

Value0.543 (54.3%)

Statement-level support (GPT-4 API)

Value0.422 (42.2%)

Response-level support (GPT-4 API)

Value0.227 (22.7%)

Who Should Care

What To Try In 7 Days

Run SourceCheckup (or similar) on your LLM output to measure URL validity and support rates.

Add an automated source verification step (use a vetted LLM like GPT-4) and flag unsupported statements for human review.

Prefer RAG plus source verification over plain API models when showing citations to users.

Reproducibility

Data Urls

  • Authors state they open-source the curated dataset (see paper for link)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • SourceCheckup measures whether a cited source supports a statement, not whether the statement is objectively true.
  • Reference pages come from only three web sources (MayoClinic, UpToDate, Reddit), which limits coverage.
  • GPT-4 used for verification could exhibit bias or make errors despite high agreement; still requires human spot checks.

When Not To Use

  • Do not treat 'supported by source' as guaranteed clinical correctness without clinician review.
  • Do not apply the pipeline as-is to private EHR data or paywalled sources without adapting access rules.
  • Do not rely solely on automated verifier for high-stakes medical decisions.

Failure Modes

  • LLM returns hallucinated or malformed URLs (invalid or non-existent pages).
  • RAG returns a valid URL that does not actually contain the claimed evidence.
  • Automated verifier mislabels borderline or implicit evidence; disagreement with experts can occur.

Core Entities

Models

  • GPT-4 (RAG)
  • GPT-4 (API)
  • Claude v2.1
  • Mistral Medium
  • Gemini Pro
  • GPT-4 (as Source Verification model)

Metrics

  • Source URL Validity
  • Statement-level support
  • Response-level support
  • Percent of URLs not supporting any statement

Datasets

  • MayoClinic web pages
  • UpToDate pages
  • Reddit r/AskDocs
  • SourceCheckup 1,200-question corpus
  • Clinician-annotated subset (N=284)

Benchmarks

  • SourceCheckup