Two-phase Verification: a probability-free check to detect hallucinations in medical QA

July 11, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

2

Authors

Jiaxin Wu, Yizhou Yu, Hong-Yu Zhou

Links

Abstract / PDF

Why It Matters For Business

Medical LLM outputs can be confidently wrong; adding a verification chain reduces risk by flagging uncertain answers before they reach users.

Summary TLDR

This paper benchmarks common uncertainty estimation (UE) methods on three medical QA datasets and Llama 2 Chat models (7b, 13b). It finds existing entropy and lexical methods perform weakly in medical QA and that larger models help. The authors propose Two-phase Verification: generate an explanation, produce per-step verification questions, answer each question twice (independent and with the statement as context), then flag inconsistencies. Two-phase outperforms baselines in average AUROC (overall 0.5858; 13b avg 0.6053) and shows the lowest variability in these experiments.

Problem Statement

LLMs can produce plausible but incorrect medical answers (hallucinations). Existing uncertainty signals (token probabilities, entropy, simple self-assessment) can be misleading in medicine or unavailable for black-box models. We need a practical, model-agnostic way to detect when an answer is likely wrong.

Main Contribution

Systematic benchmark of popular UE methods (lexical/semantic/predictive/length-normalized entropy, self-checking and CoVe) on PubMedQA, MedQA, MedMCQA using Llama 2 Chat (7b, 13b).

Two-phase Verification: a probability-free verification chain that answers verification questions twice (independent vs. with statement) and uses bidirectional entailment to quantify inconsistency.

Empirical finding that Two-phase Verification gives the best average AUROC and the most stable results across datasets and model sizes in the tests conducted.

Analysis of failure modes and practical limits: verification-question quality and model-domain knowledge limit performance; dense retrieval from generic sources often has low relevance.

Key Findings

Two-phase Verification achieved the highest overall average AUROC among methods tested.

NumbersOverall average AUROC = 0.5858; 13b average = 0.6053

Entropy and lexical similarity methods perform inconsistently and often poorly in medical QA.

NumbersLexical Similarity overall avg AUROC = 0.4662; Semantic Entropy = 0.5331

Larger models improved UE performance in these experiments.

NumbersTwo-phase avg AUROC: 7b = 0.5663, 13b = 0.6053 (increase ≈ 0.039)

Two-phase Verification produced more stable results across datasets.

NumbersOverall SD for Two-phase = 0.0411 (lowest among methods)

Results

Overall average AUROC (all datasets, both sizes)

Value0.5858 (Two-phase)

Baseline0.5670 (CoVe)

Average AUROC (Llama 2 Chat 13b)

Value0.6053 (Two-phase)

Baseline0.5595 (CoVe)

Average AUROC (Llama 2 Chat 7b)

Value0.5663 (Two-phase)

Baseline0.5745 (CoVe)

Stability (overall SD)

Value0.0411 (Two-phase)

Baseline0.0694 (CoVe)

Who Should Care

What To Try In 7 Days

Run Two-phase Verification on a small set of real medical prompts to compare flagged vs. known incorrect answers.

Compare Two-phase uncertainty scores against simple entropy or lexical checks to see practical improvement.

Use few-shot templates for verification question generation and inspect failures to improve prompts quickly.

Reproducibility

Data Urls

  • PubMedQA, MedQA, MedMCQA (referenced datasets)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Quality of verification questions can miss needed context or rely on pronoun resolution.
  • Method performance depends on the model's medical knowledge; general models may lack depth.
  • Experiments limited to two Llama 2 Chat sizes and three datasets; broader generalization is untested.
  • Dense retrieval from generic sources (Wikipedia) often returned low-relevance evidence.

When Not To Use

  • If the base model has very weak domain knowledge (small LMs), since verification answers may be unreliable.
  • When you cannot form per-step verification questions (very short or ambiguous explanations).
  • If low-latency is required, because Two-phase doubles verification calls and entailment checks.

Failure Modes

  • The model answers verification questions consistently but both answers are jointly wrong (false negative).
  • Independent answer introduces extra or missing details, causing false inconsistency flags.
  • Entailment model misclassifies paraphrases as inconsistent, producing false positives.
  • Poor retrieval yields irrelevant context, degrading independent verification quality.

Core Entities

Models

  • Llama 2 Chat (7b)
  • Llama 2 Chat (13b)
  • DeBERTa-large (used for bidirectional entailment check)

Metrics

  • AUROC

Datasets

  • PubMedQA
  • MedQA
  • MedMCQA