Practical checklist to measure, detect, and reduce LLM hallucinations in healthcare

September 26, 20236 min

Overview

Decision SnapshotNeeds Validation

This is a focused survey with literature-backed recommendations, not a new method; advice is actionable but evidence is mixed and largely qualitative.

Citations14

Evidence Strength0.40

Confidence0.60

Risk Signals8

Trust Signals

Findings with numeric evidence: 1/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/2

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 20%

Authors

Muhammad Aurangzeb Ahmad, Ilker Yaramis, Taposh Dutta Roy

Links

Abstract / PDF

Why It Matters For Business

In healthcare, LLM mistakes can harm patients and create liability. Measuring and mitigating hallucinations is necessary before deploying models in clinical workflows.

Who Should Care

Summary TLDR

This short survey focuses on AI 'hallucinations'—when language models invent or misstate facts—and how they block safe use in healthcare. It reviews causes (bad sources, probabilistic text sampling, biased training, missing context), ways to measure hallucinations (human annotation, automated checks, self-check sampling), and mitigation steps (human-in-the-loop, fine-tuning, better prompts, input validation, adversarial training, memory/knowledge augmentation, and benchmark audits). The paper argues human oversight is likely required for high-risk clinical tasks and flags benchmark and evaluation blind spots that can amplify errors.

Problem Statement

Large language models produce plausible but incorrect statements. In healthcare, these hallucinations can mislead diagnosis, treatment, or advice. The paper asks: how do we measure, validate, and reduce hallucinations so LLMs become trustworthy enough for clinical use?

Main Contribution

Survey of causes, measurement methods, and mitigation strategies for hallucinations in healthcare LLMs.

Practical taxonomy of evaluation options: model-access checks, multiple-output self-checking, human annotation, and automatic scorers.

Key Findings

A study found ~25% of generated summaries contained hallucinated content.

Numbers25% hallucinated summaries

Practical UseTreat LLM summaries as partially unreliable; add fact-checking or human review before clinical use.

Evidence Ref[17] Falke et al. (summarization study)

Hallucinations are a core, widely acknowledged limitation of LLMs and arise from how token generation works.

Practical UseExpect some fabricated outputs by default; design systems to flag uncertainty and verify facts.

Evidence Ref[37] OpenAI statement; survey discussion

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
share of hallucinated content in summarization25% of generated summariessummarization study in [17]Falke et al. reported ~25% hallucinated content in summaries[17]
model robustness to self-contradictionGPT-4/ChatGPT outperform Vicuna-13BVicuna-13Bself-contradiction detection studySurvey cites [31] comparing models on self-contradictions[31]

What To Try In 7 Days

Run a small audit: sample model outputs on real tasks and spot-check facts with clinicians.

Enable model self-checking: sample multiple answers and flag divergent responses for review.

Add a simple human-in-the-loop gate for high-risk outputs (triage a subset for expert review).

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Survey format: no new experiments or quantitative benchmarks provided.

Recommendations are broad; effectiveness depends on specific model and clinical task.

When Not To Use

Do not rely on LLM outputs without human review for high-stakes clinical decisions.

Avoid treating MCQ benchmark success as proof of real-world clinical safety.

Failure Modes

Model fabricates plausible but false medical facts.

Benchmarks or training data contain errors that amplify hallucinations when reused.

Core Entities

Models

GPT-3GPT-4ChatGPTGatorTronMed-PaLM 2Flan-PaLMVicuna-13B

Metrics

precisionrecallF1perplexitycross-entropyROUGEBLEUMETEOR

Datasets

USMLECMExamJapanese medical licensing exam datasets

Benchmarks

TruthfulQAFActScoreknowledge-grounded conversational benchmarks