Two-phase Verification: a probability-free check to detect hallucinations in medical QA

Overview

Decision SnapshotNeeds Validation

Method is practical and model-agnostic but tested only on Llama 2 Chat and three datasets; results are promising but need wider validation and better domain retrieval to reach production safety.

Citations2

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 60%

Authors

Jiaxin Wu, Yizhou Yu, Hong-Yu Zhou

Links

Abstract / PDF / Data

Why It Matters For Business

Medical LLM outputs can be confidently wrong; adding a verification chain reduces risk by flagging uncertain answers before they reach users.

Who Should Care

ML Engineer Product Manager Data Scientist CTO

Summary TLDR

This paper benchmarks common uncertainty estimation (UE) methods on three medical QA datasets and Llama 2 Chat models (7b, 13b). It finds existing entropy and lexical methods perform weakly in medical QA and that larger models help. The authors propose Two-phase Verification: generate an explanation, produce per-step verification questions, answer each question twice (independent and with the statement as context), then flag inconsistencies. Two-phase outperforms baselines in average AUROC (overall 0.5858; 13b avg 0.6053) and shows the lowest variability in these experiments.

Problem Statement

LLMs can produce plausible but incorrect medical answers (hallucinations). Existing uncertainty signals (token probabilities, entropy, simple self-assessment) can be misleading in medicine or unavailable for black-box models. We need a practical, model-agnostic way to detect when an answer is likely wrong.

Main Contribution

Systematic benchmark of popular UE methods (lexical/semantic/predictive/length-normalized entropy, self-checking and CoVe) on PubMedQA, MedQA, MedMCQA using Llama 2 Chat (7b, 13b).

Two-phase Verification: a probability-free verification chain that answers verification questions twice (independent vs. with statement) and uses bidirectional entailment to quantify inconsistency.

Key Findings

Two-phase Verification achieved the highest overall average AUROC among methods tested.

NumbersOverall average AUROC = 0.5858; 13b average = 0.6053

Practical UseUse Two-phase Verification to get a more reliable uncertainty signal than simple entropy or lexical checks on medical QA tasks.

Evidence RefTable 1 (Overall average and Llama 2 Chat 13b averages)

Entropy and lexical similarity methods perform inconsistently and often poorly in medical QA.

NumbersLexical Similarity overall avg AUROC = 0.4662; Semantic Entropy = 0.5331

Practical UseDo not rely on token/lexical-only uncertainty measures for medical outputs; add semantic or verification-based checks.

Evidence RefTable 1 (overall averages for LS and SE)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Overall average AUROC (all datasets, both sizes)	0.5858 (Two-phase)	0.5670 (CoVe)	+0.0188	All datasets, 7b+13b aggregate	Table 1 overall average row	Table 1
Average AUROC (Llama 2 Chat 13b)	0.6053 (Two-phase)	0.5595 (CoVe)	+0.0458	Average over PubMedQA, MedQA, MedMCQA for 13b	Table 1 Llama 2 Chat (13b) average row	Table 1

What To Try In 7 Days

Run Two-phase Verification on a small set of real medical prompts to compare flagged vs. known incorrect answers.

Compare Two-phase uncertainty scores against simple entropy or lexical checks to see practical improvement.

Use few-shot templates for verification question generation and inspect failures to improve prompts quickly.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

PubMedQA, MedQA, MedMCQA (referenced datasets)

Risks & Boundaries

Limitations

Quality of verification questions can miss needed context or rely on pronoun resolution.

Method performance depends on the model's medical knowledge; general models may lack depth.

When Not To Use

If the base model has very weak domain knowledge (small LMs), since verification answers may be unreliable.

When you cannot form per-step verification questions (very short or ambiguous explanations).

Failure Modes

The model answers verification questions consistently but both answers are jointly wrong (false negative).

Independent answer introduces extra or missing details, causing false inconsistency flags.

Core Entities

Models

Llama 2 Chat (7b)Llama 2 Chat (13b)DeBERTa-large (used for bidirectional entailment check)

Metrics

AUROC

Datasets

PubMedQAMedQAMedMCQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Two-phase Verification achieved the highest overall average AUROC among methods tested.

Entropy and lexical similarity methods perform inconsistently and often poorly in medical QA.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding