Overview
This is a focused survey with literature-backed recommendations, not a new method; advice is actionable but evidence is mixed and largely qualitative.
Citations14
Evidence Strength0.40
Confidence0.60
Risk Signals8
Trust Signals
Findings with numeric evidence: 1/6
Findings with evidence refs: 6/6
Results with explicit delta: 0/2
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 30%
Production readiness: 40%
Novelty: 20%
Why It Matters For Business
In healthcare, LLM mistakes can harm patients and create liability. Measuring and mitigating hallucinations is necessary before deploying models in clinical workflows.
Who Should Care
Summary TLDR
This short survey focuses on AI 'hallucinations'—when language models invent or misstate facts—and how they block safe use in healthcare. It reviews causes (bad sources, probabilistic text sampling, biased training, missing context), ways to measure hallucinations (human annotation, automated checks, self-check sampling), and mitigation steps (human-in-the-loop, fine-tuning, better prompts, input validation, adversarial training, memory/knowledge augmentation, and benchmark audits). The paper argues human oversight is likely required for high-risk clinical tasks and flags benchmark and evaluation blind spots that can amplify errors.
Problem Statement
Large language models produce plausible but incorrect statements. In healthcare, these hallucinations can mislead diagnosis, treatment, or advice. The paper asks: how do we measure, validate, and reduce hallucinations so LLMs become trustworthy enough for clinical use?
Main Contribution
Survey of causes, measurement methods, and mitigation strategies for hallucinations in healthcare LLMs.
Practical taxonomy of evaluation options: model-access checks, multiple-output self-checking, human annotation, and automatic scorers.
Key Findings
A study found ~25% of generated summaries contained hallucinated content.
Hallucinations are a core, widely acknowledged limitation of LLMs and arise from how token generation works.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| share of hallucinated content in summarization | 25% of generated summaries | — | — | summarization study in [17] | Falke et al. reported ~25% hallucinated content in summaries | [17] |
| model robustness to self-contradiction | GPT-4/ChatGPT outperform Vicuna-13B | Vicuna-13B | — | self-contradiction detection study | Survey cites [31] comparing models on self-contradictions | [31] |
What To Try In 7 Days
Run a small audit: sample model outputs on real tasks and spot-check facts with clinicians.
Enable model self-checking: sample multiple answers and flag divergent responses for review.
Add a simple human-in-the-loop gate for high-risk outputs (triage a subset for expert review).
Reproducibility
Risks & Boundaries
Limitations
Survey format: no new experiments or quantitative benchmarks provided.
Recommendations are broad; effectiveness depends on specific model and clinical task.
When Not To Use
Do not rely on LLM outputs without human review for high-stakes clinical decisions.
Avoid treating MCQ benchmark success as proof of real-world clinical safety.
Failure Modes
Model fabricates plausible but false medical facts.
Benchmarks or training data contain errors that amplify hallucinations when reused.

