Two-phase Verification: a probability-free check to detect hallucinations in medical QA

July 11, 20247 min

Overview

Decision SnapshotNeeds Validation

Method is practical and model-agnostic but tested only on Llama 2 Chat and three datasets; results are promising but need wider validation and better domain retrieval to reach production safety.

Citations2

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Jiaxin Wu, Yizhou Yu, Hong-Yu Zhou

Links

Abstract / PDF / Data

Why It Matters For Business

Medical LLM outputs can be confidently wrong; adding a verification chain reduces risk by flagging uncertain answers before they reach users.

Who Should Care

Summary TLDR

This paper benchmarks common uncertainty estimation (UE) methods on three medical QA datasets and Llama 2 Chat models (7b, 13b). It finds existing entropy and lexical methods perform weakly in medical QA and that larger models help. The authors propose Two-phase Verification: generate an explanation, produce per-step verification questions, answer each question twice (independent and with the statement as context), then flag inconsistencies. Two-phase outperforms baselines in average AUROC (overall 0.5858; 13b avg 0.6053) and shows the lowest variability in these experiments.

Problem Statement

LLMs can produce plausible but incorrect medical answers (hallucinations). Existing uncertainty signals (token probabilities, entropy, simple self-assessment) can be misleading in medicine or unavailable for black-box models. We need a practical, model-agnostic way to detect when an answer is likely wrong.

Main Contribution

Systematic benchmark of popular UE methods (lexical/semantic/predictive/length-normalized entropy, self-checking and CoVe) on PubMedQA, MedQA, MedMCQA using Llama 2 Chat (7b, 13b).

Two-phase Verification: a probability-free verification chain that answers verification questions twice (independent vs. with statement) and uses bidirectional entailment to quantify inconsistency.

Key Findings

Two-phase Verification achieved the highest overall average AUROC among methods tested.

NumbersOverall average AUROC = 0.5858; 13b average = 0.6053

Practical UseUse Two-phase Verification to get a more reliable uncertainty signal than simple entropy or lexical checks on medical QA tasks.

Evidence RefTable 1 (Overall average and Llama 2 Chat 13b averages)

Entropy and lexical similarity methods perform inconsistently and often poorly in medical QA.

NumbersLexical Similarity overall avg AUROC = 0.4662; Semantic Entropy = 0.5331

Practical UseDo not rely on token/lexical-only uncertainty measures for medical outputs; add semantic or verification-based checks.

Evidence RefTable 1 (overall averages for LS and SE)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Overall average AUROC (all datasets, both sizes)0.5858 (Two-phase)0.5670 (CoVe)+0.0188All datasets, 7b+13b aggregateTable 1 overall average rowTable 1
Average AUROC (Llama 2 Chat 13b)0.6053 (Two-phase)0.5595 (CoVe)+0.0458Average over PubMedQA, MedQA, MedMCQA for 13bTable 1 Llama 2 Chat (13b) average rowTable 1

What To Try In 7 Days

Run Two-phase Verification on a small set of real medical prompts to compare flagged vs. known incorrect answers.

Compare Two-phase uncertainty scores against simple entropy or lexical checks to see practical improvement.

Use few-shot templates for verification question generation and inspect failures to improve prompts quickly.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

PubMedQA, MedQA, MedMCQA (referenced datasets)

Risks & Boundaries

Limitations

Quality of verification questions can miss needed context or rely on pronoun resolution.

Method performance depends on the model's medical knowledge; general models may lack depth.

When Not To Use

If the base model has very weak domain knowledge (small LMs), since verification answers may be unreliable.

When you cannot form per-step verification questions (very short or ambiguous explanations).

Failure Modes

The model answers verification questions consistently but both answers are jointly wrong (false negative).

Independent answer introduces extra or missing details, causing false inconsistency flags.

Core Entities

Models

Llama 2 Chat (7b)Llama 2 Chat (13b)DeBERTa-large (used for bidirectional entailment check)

Metrics

AUROC

Datasets

PubMedQAMedQAMedMCQA