Use an iterative generate-score-refine loop to cut hallucinated answers from medical LLMs

October 10, 20237 min

Overview

Decision SnapshotNeeds Validation

The method shows consistent automatic and human-evaluated improvements across five datasets and several models, but it remains retrieval-free and not yet safe for unvetted clinical deployment.

Citations17

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 25%

Production readiness: 30%

Novelty: 55%

Authors

Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, Pascale Fung

Links

Abstract / PDF / Code

Why It Matters For Business

Adding an iterative generate-score-refine step reduces irrelevant and factually inconsistent medical answers, lowering risk and improving trust for AI assistants used in healthcare workflows.

Who Should Care

Summary TLDR

The authors study hallucination in medical generative question answering and propose an interactive self-reflection loop: generate background knowledge, score it, refine it, then generate answers and repeat until thresholds pass. Tested on five LLMs (Vicuna, Alpaca-LoRA, ChatGPT, MedAlpaca, Robin-medical) and five medical QA datasets, the loop increases Med-NLI entailment scores and reduces human-annotated hallucination categories. Ablations show the scoring, explicit aspect prompts, and reporting numeric scores each help. The method is retrieval-free, iterative, and intended as a complementary mitigation step, not a full fix.

Problem Statement

LLMs often produce plausible but false or irrelevant medical answers (hallucinations). This is risky in healthcare. The paper asks: how often do popular LLMs hallucinate on medical QA, why (e.g., rare topics), and can an iterative self-reflection loop reduce hallucinations without external retrieval?

Main Contribution

A systematic analysis of hallucinated answers across five LLMs and five medical QA datasets.

An interactive self-reflection method that repeatedly generates, scores, and refines background knowledge and answers.

Key Findings

Iterative self-reflection raises Med-NLI sample entailment scores across models on PubMedQA.

NumbersVicuna: 0.4684 -> 0.6380 (+0.1696); ChatGPT: 0.5850 -> 0.6824 (+0.0974)

Practical UseAdd a generate-score-refine loop to improve answer entailment and reduce hallucinations in medical QA systems.

Evidence RefTable 2 (PubMedQA MedNLI sample)

Big relative gains for weaker models: Alpaca-LoRA MedNLI sample on PubMedQA grew from 0.0940 to 0.4640.

NumbersAlpaca-LoRA: 0.0940 -> 0.4640 (+0.3700)

Practical UseSelf-reflection is especially valuable for smaller or less medical-finetuned models.

Evidence RefTable 2 (PubMedQA)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MedNLI (sample)Vicuna: 0.6380 (with loop)Vicuna: 0.4684 (baseline)+0.1696PubMedQATable 2 (PubMedQA)Table 2
MedNLI (sample)Alpaca-LoRA: 0.4640 (with loop)Alpaca-LoRA: 0.0940 (baseline)+0.3700PubMedQATable 2 (PubMedQA)Table 2

What To Try In 7 Days

Implement a simple loop: generate background facts, score factuality, and re-prompt model to refine until a threshold.

Add an entailment check (Med-NLI or sentence embedding similarity) between answer and question/context.

Run a small human review on 50 examples to confirm reductions in query-inconsistency and tangential replies.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Does not fully eliminate hallucinations; model can still generate ungrounded claims.

Evaluated only on English medical QA; generalization to other languages and domains is untested.

When Not To Use

In high-stakes clinical decisions without additional verification and expert oversight.

Where strict low-latency responses are required.

Failure Modes

Model refines toward confident but still incorrect knowledge (self-reinforced hallucination).

Score thresholds mis-set and cause endless or insufficient refinement.

Core Entities

Models

Vicuna-7BLoRAChatGPTMedAlpaca-7BRobin-medical-7BGPT4 (casual examples)

Metrics

Med-NLI (sample and sentence)CTRLEval (consistency)unigram F1ROUGE-Lhuman annotation (query-inconsistency, tangentiality, fact-consistency)

Datasets

PubMedQAMedQuADMEDIQA2019LiveMedQA2017MASH-QA