Practical checklist to measure, detect, and reduce LLM hallucinations in healthcare

September 26, 20236 min

Overview

Production Readiness

0.4

Novelty Score

0.2

Cost Impact Score

0.3

Citation Count

14

Authors

Muhammad Aurangzeb Ahmad, Ilker Yaramis, Taposh Dutta Roy

Links

Abstract / PDF

Why It Matters For Business

In healthcare, LLM mistakes can harm patients and create liability. Measuring and mitigating hallucinations is necessary before deploying models in clinical workflows.

Summary TLDR

This short survey focuses on AI 'hallucinations'—when language models invent or misstate facts—and how they block safe use in healthcare. It reviews causes (bad sources, probabilistic text sampling, biased training, missing context), ways to measure hallucinations (human annotation, automated checks, self-check sampling), and mitigation steps (human-in-the-loop, fine-tuning, better prompts, input validation, adversarial training, memory/knowledge augmentation, and benchmark audits). The paper argues human oversight is likely required for high-risk clinical tasks and flags benchmark and evaluation blind spots that can amplify errors.

Problem Statement

Large language models produce plausible but incorrect statements. In healthcare, these hallucinations can mislead diagnosis, treatment, or advice. The paper asks: how do we measure, validate, and reduce hallucinations so LLMs become trustworthy enough for clinical use?

Main Contribution

Survey of causes, measurement methods, and mitigation strategies for hallucinations in healthcare LLMs.

Practical taxonomy of evaluation options: model-access checks, multiple-output self-checking, human annotation, and automatic scorers.

Checklist of mitigation approaches: human-in-the-loop, fine-tuning, prompt design, algorithmic fixes, input validation, adversarial training, memory augmentation, and benchmark auditing.

Discussion of evaluation blind spots: MCQ benchmarks misrepresent clinical uncertainty and contaminated benchmarks can amplify hallucinations.

Key Findings

A study found ~25% of generated summaries contained hallucinated content.

Numbers25% hallucinated summaries

Hallucinations are a core, widely acknowledged limitation of LLMs and arise from how token generation works.

Human evaluation and fine-grained fact scoring (e.g., FActScore) are common methods to detect factual errors.

Benchmarks and training sets can contain errors that increase hallucination risk if added back to training data.

Model choice matters: GPT-4/ChatGPT catch self-contradictions better than some smaller open models (e.g., Vicuna-13B).

Fine-tuning can reduce hallucination risk but does not guarantee improvement and can be costly.

Results

share of hallucinated content in summarization

Value25% of generated summaries

model robustness to self-contradiction

ValueGPT-4/ChatGPT outperform Vicuna-13B

BaselineVicuna-13B

Who Should Care

What To Try In 7 Days

Run a small audit: sample model outputs on real tasks and spot-check facts with clinicians.

Enable model self-checking: sample multiple answers and flag divergent responses for review.

Add a simple human-in-the-loop gate for high-risk outputs (triage a subset for expert review).

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Survey format: no new experiments or quantitative benchmarks provided.
  • Recommendations are broad; effectiveness depends on specific model and clinical task.
  • Many mitigation strategies (fine-tuning, memory augmentation) are described but lack cost/scale analysis.

When Not To Use

  • Do not rely on LLM outputs without human review for high-stakes clinical decisions.
  • Avoid treating MCQ benchmark success as proof of real-world clinical safety.

Failure Modes

  • Model fabricates plausible but false medical facts.
  • Benchmarks or training data contain errors that amplify hallucinations when reused.
  • Human-in-the-loop scaling limits: oversight may not scale to high-volume automation.

Core Entities

Models

  • GPT-3
  • GPT-4
  • ChatGPT
  • GatorTron
  • Med-PaLM 2
  • Flan-PaLM
  • Vicuna-13B

Metrics

  • precision
  • recall
  • F1
  • perplexity
  • cross-entropy
  • ROUGE
  • BLEU
  • METEOR

Datasets

  • USMLE
  • CMExam
  • Japanese medical licensing exam datasets

Benchmarks

  • TruthfulQA
  • FActScore
  • knowledge-grounded conversational benchmarks