Use an iterative generate-score-refine loop to cut hallucinated answers from medical LLMs

October 10, 20237 min

Overview

Production Readiness

0.3

Novelty Score

0.55

Cost Impact Score

0.25

Citation Count

17

Authors

Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, Pascale Fung

Links

Abstract / PDF

Why It Matters For Business

Adding an iterative generate-score-refine step reduces irrelevant and factually inconsistent medical answers, lowering risk and improving trust for AI assistants used in healthcare workflows.

Summary TLDR

The authors study hallucination in medical generative question answering and propose an interactive self-reflection loop: generate background knowledge, score it, refine it, then generate answers and repeat until thresholds pass. Tested on five LLMs (Vicuna, Alpaca-LoRA, ChatGPT, MedAlpaca, Robin-medical) and five medical QA datasets, the loop increases Med-NLI entailment scores and reduces human-annotated hallucination categories. Ablations show the scoring, explicit aspect prompts, and reporting numeric scores each help. The method is retrieval-free, iterative, and intended as a complementary mitigation step, not a full fix.

Problem Statement

LLMs often produce plausible but false or irrelevant medical answers (hallucinations). This is risky in healthcare. The paper asks: how often do popular LLMs hallucinate on medical QA, why (e.g., rare topics), and can an iterative self-reflection loop reduce hallucinations without external retrieval?

Main Contribution

A systematic analysis of hallucinated answers across five LLMs and five medical QA datasets.

An interactive self-reflection method that repeatedly generates, scores, and refines background knowledge and answers.

Empirical evidence (automatic metrics and human evaluation) that the loop raises entailment and reduces hallucination; ablations identify key parts of the loop.

Key Findings

Iterative self-reflection raises Med-NLI sample entailment scores across models on PubMedQA.

NumbersVicuna: 0.4684 -> 0.6380 (+0.1696); ChatGPT: 0.5850 -> 0.6824 (+0.0974)

Big relative gains for weaker models: Alpaca-LoRA MedNLI sample on PubMedQA grew from 0.0940 to 0.4640.

NumbersAlpaca-LoRA: 0.0940 -> 0.4640 (+0.3700)

Human evaluation shows lower query-inconsistency and tangential responses after applying the loop.

NumbersVicuna query-inconsistent: 0.67% -> 0.00%; tangentiality: 6.04% -> 2.00%; ChatGPT fact-inconsistent: 8.06% -> 6.33%

Ablation shows scoring, explicit aspect prompts, and reporting numeric scores each improve factuality and consistency.

NumbersVicuna_L MedNLI sample drops .6380 -> .4520 without refinement; ChatGPT_L .6824 -> .5180 without refinement

Low-frequency topics correlate with higher incidence of problematic answers.

NumbersProblematic answers show lower Google Ngram frequency on average (Figure 3)

Results

MedNLI (sample)

ValueVicuna: 0.6380 (with loop)

BaselineVicuna: 0.4684 (baseline)

MedNLI (sample)

ValueAlpaca-LoRA: 0.4640 (with loop)

BaselineAlpaca-LoRA: 0.0940 (baseline)

MedNLI (sample)

ValueChatGPT: 0.6824 (with loop)

BaselineChatGPT: 0.5850 (baseline)

Human eval — Query-Inconsistent

ValueVicuna_L: 0.00%

BaselineVicuna: 0.67%

Human eval — Tangentiality

ValueVicuna_L: 2.00%

BaselineVicuna: 6.04%

Who Should Care

What To Try In 7 Days

Implement a simple loop: generate background facts, score factuality, and re-prompt model to refine until a threshold.

Add an entailment check (Med-NLI or sentence embedding similarity) between answer and question/context.

Run a small human review on 50 examples to confirm reductions in query-inconsistency and tangential replies.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Does not fully eliminate hallucinations; model can still generate ungrounded claims.
  • Evaluated only on English medical QA; generalization to other languages and domains is untested.
  • Iterative loops increase inference time and cost; not optimized for low-latency production.
  • Current thresholds and in-context demos are tuned per dataset and model; may need retuning.

When Not To Use

  • In high-stakes clinical decisions without additional verification and expert oversight.
  • Where strict low-latency responses are required.
  • For languages or domains not covered by evaluation.

Failure Modes

  • Model refines toward confident but still incorrect knowledge (self-reinforced hallucination).
  • Score thresholds mis-set and cause endless or insufficient refinement.
  • Rare-topic answers remain unreliable due to weak parametric knowledge.

Core Entities

Models

  • Vicuna-7B
  • LoRA
  • ChatGPT
  • MedAlpaca-7B
  • Robin-medical-7B
  • GPT4 (casual examples)

Metrics

  • Med-NLI (sample and sentence)
  • CTRLEval (consistency)
  • unigram F1
  • ROUGE-L
  • human annotation (query-inconsistency, tangentiality, fact-consistency)

Datasets

  • PubMedQA
  • MedQuAD
  • MEDIQA2019
  • LiveMedQA2017
  • MASH-QA