Overview
Production Readiness
0.3
Novelty Score
0.55
Cost Impact Score
0.25
Citation Count
17
Why It Matters For Business
Adding an iterative generate-score-refine step reduces irrelevant and factually inconsistent medical answers, lowering risk and improving trust for AI assistants used in healthcare workflows.
Summary TLDR
The authors study hallucination in medical generative question answering and propose an interactive self-reflection loop: generate background knowledge, score it, refine it, then generate answers and repeat until thresholds pass. Tested on five LLMs (Vicuna, Alpaca-LoRA, ChatGPT, MedAlpaca, Robin-medical) and five medical QA datasets, the loop increases Med-NLI entailment scores and reduces human-annotated hallucination categories. Ablations show the scoring, explicit aspect prompts, and reporting numeric scores each help. The method is retrieval-free, iterative, and intended as a complementary mitigation step, not a full fix.
Problem Statement
LLMs often produce plausible but false or irrelevant medical answers (hallucinations). This is risky in healthcare. The paper asks: how often do popular LLMs hallucinate on medical QA, why (e.g., rare topics), and can an iterative self-reflection loop reduce hallucinations without external retrieval?
Main Contribution
A systematic analysis of hallucinated answers across five LLMs and five medical QA datasets.
An interactive self-reflection method that repeatedly generates, scores, and refines background knowledge and answers.
Empirical evidence (automatic metrics and human evaluation) that the loop raises entailment and reduces hallucination; ablations identify key parts of the loop.
Key Findings
Iterative self-reflection raises Med-NLI sample entailment scores across models on PubMedQA.
Big relative gains for weaker models: Alpaca-LoRA MedNLI sample on PubMedQA grew from 0.0940 to 0.4640.
Human evaluation shows lower query-inconsistency and tangential responses after applying the loop.
Ablation shows scoring, explicit aspect prompts, and reporting numeric scores each improve factuality and consistency.
Low-frequency topics correlate with higher incidence of problematic answers.
Results
MedNLI (sample)
MedNLI (sample)
MedNLI (sample)
Human eval — Query-Inconsistent
Human eval — Tangentiality
Who Should Care
What To Try In 7 Days
Implement a simple loop: generate background facts, score factuality, and re-prompt model to refine until a threshold.
Add an entailment check (Med-NLI or sentence embedding similarity) between answer and question/context.
Run a small human review on 50 examples to confirm reductions in query-inconsistency and tangential replies.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Does not fully eliminate hallucinations; model can still generate ungrounded claims.
- Evaluated only on English medical QA; generalization to other languages and domains is untested.
- Iterative loops increase inference time and cost; not optimized for low-latency production.
- Current thresholds and in-context demos are tuned per dataset and model; may need retuning.
When Not To Use
- In high-stakes clinical decisions without additional verification and expert oversight.
- Where strict low-latency responses are required.
- For languages or domains not covered by evaluation.
Failure Modes
- Model refines toward confident but still incorrect knowledge (self-reinforced hallucination).
- Score thresholds mis-set and cause endless or insufficient refinement.
- Rare-topic answers remain unreliable due to weak parametric knowledge.
Core Entities
Models
- Vicuna-7B
- LoRA
- ChatGPT
- MedAlpaca-7B
- Robin-medical-7B
- GPT4 (casual examples)
Metrics
- Med-NLI (sample and sentence)
- CTRLEval (consistency)
- unigram F1
- ROUGE-L
- human annotation (query-inconsistency, tangentiality, fact-consistency)
Datasets
- PubMedQA
- MedQuAD
- MEDIQA2019
- LiveMedQA2017
- MASH-QA

