Use an iterative generate-score-refine loop to cut hallucinated answers from medical LLMs

Overview

Decision SnapshotNeeds Validation

The method shows consistent automatic and human-evaluated improvements across five datasets and several models, but it remains retrieval-free and not yet safe for unvetted clinical deployment.

Citations17

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 25%

Production readiness: 30%

Novelty: 55%

Authors

Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, Pascale Fung

Links

Abstract / PDF / Code

Why It Matters For Business

Adding an iterative generate-score-refine step reduces irrelevant and factually inconsistent medical answers, lowering risk and improving trust for AI assistants used in healthcare workflows.

Who Should Care

ML Engineer Product Manager CTO

Summary TLDR

The authors study hallucination in medical generative question answering and propose an interactive self-reflection loop: generate background knowledge, score it, refine it, then generate answers and repeat until thresholds pass. Tested on five LLMs (Vicuna, Alpaca-LoRA, ChatGPT, MedAlpaca, Robin-medical) and five medical QA datasets, the loop increases Med-NLI entailment scores and reduces human-annotated hallucination categories. Ablations show the scoring, explicit aspect prompts, and reporting numeric scores each help. The method is retrieval-free, iterative, and intended as a complementary mitigation step, not a full fix.

Problem Statement

LLMs often produce plausible but false or irrelevant medical answers (hallucinations). This is risky in healthcare. The paper asks: how often do popular LLMs hallucinate on medical QA, why (e.g., rare topics), and can an iterative self-reflection loop reduce hallucinations without external retrieval?

Main Contribution

A systematic analysis of hallucinated answers across five LLMs and five medical QA datasets.

An interactive self-reflection method that repeatedly generates, scores, and refines background knowledge and answers.

Key Findings

Iterative self-reflection raises Med-NLI sample entailment scores across models on PubMedQA.

NumbersVicuna: 0.4684 -> 0.6380 (+0.1696); ChatGPT: 0.5850 -> 0.6824 (+0.0974)

Practical UseAdd a generate-score-refine loop to improve answer entailment and reduce hallucinations in medical QA systems.

Evidence RefTable 2 (PubMedQA MedNLI sample)

Big relative gains for weaker models: Alpaca-LoRA MedNLI sample on PubMedQA grew from 0.0940 to 0.4640.

NumbersAlpaca-LoRA: 0.0940 -> 0.4640 (+0.3700)

Practical UseSelf-reflection is especially valuable for smaller or less medical-finetuned models.

Evidence RefTable 2 (PubMedQA)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
MedNLI (sample)	Vicuna: 0.6380 (with loop)	Vicuna: 0.4684 (baseline)	+0.1696	PubMedQA	Table 2 (PubMedQA)	Table 2
MedNLI (sample)	Alpaca-LoRA: 0.4640 (with loop)	Alpaca-LoRA: 0.0940 (baseline)	+0.3700	PubMedQA	Table 2 (PubMedQA)	Table 2

What To Try In 7 Days

Implement a simple loop: generate background facts, score factuality, and re-prompt model to refine until a threshold.

Add an entailment check (Med-NLI or sentence embedding similarity) between answer and question/context.

Run a small human review on 50 examples to confirm reductions in query-inconsistency and tangential replies.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/ziweiji/Self_Reflection_Medical

Risks & Boundaries

Limitations

Does not fully eliminate hallucinations; model can still generate ungrounded claims.

Evaluated only on English medical QA; generalization to other languages and domains is untested.

When Not To Use

In high-stakes clinical decisions without additional verification and expert oversight.

Where strict low-latency responses are required.

Failure Modes

Model refines toward confident but still incorrect knowledge (self-reinforced hallucination).

Score thresholds mis-set and cause endless or insufficient refinement.

Core Entities

Models

Vicuna-7BLoRAChatGPTMedAlpaca-7BRobin-medical-7BGPT4 (casual examples)

Metrics

Med-NLI (sample and sentence)CTRLEval (consistency)unigram F1ROUGE-Lhuman annotation (query-inconsistency, tangentiality, fact-consistency)

Datasets

PubMedQAMedQuADMEDIQA2019LiveMedQA2017MASH-QA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Iterative self-reflection raises Med-NLI sample entailment scores across models on PubMedQA.

Big relative gains for weaker models: Alpaca-LoRA MedNLI sample on PubMedQA grew from 0.0940 to 0.4640.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding