Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
2
Why It Matters For Business
Medical LLM outputs can be confidently wrong; adding a verification chain reduces risk by flagging uncertain answers before they reach users.
Summary TLDR
This paper benchmarks common uncertainty estimation (UE) methods on three medical QA datasets and Llama 2 Chat models (7b, 13b). It finds existing entropy and lexical methods perform weakly in medical QA and that larger models help. The authors propose Two-phase Verification: generate an explanation, produce per-step verification questions, answer each question twice (independent and with the statement as context), then flag inconsistencies. Two-phase outperforms baselines in average AUROC (overall 0.5858; 13b avg 0.6053) and shows the lowest variability in these experiments.
Problem Statement
LLMs can produce plausible but incorrect medical answers (hallucinations). Existing uncertainty signals (token probabilities, entropy, simple self-assessment) can be misleading in medicine or unavailable for black-box models. We need a practical, model-agnostic way to detect when an answer is likely wrong.
Main Contribution
Systematic benchmark of popular UE methods (lexical/semantic/predictive/length-normalized entropy, self-checking and CoVe) on PubMedQA, MedQA, MedMCQA using Llama 2 Chat (7b, 13b).
Two-phase Verification: a probability-free verification chain that answers verification questions twice (independent vs. with statement) and uses bidirectional entailment to quantify inconsistency.
Empirical finding that Two-phase Verification gives the best average AUROC and the most stable results across datasets and model sizes in the tests conducted.
Analysis of failure modes and practical limits: verification-question quality and model-domain knowledge limit performance; dense retrieval from generic sources often has low relevance.
Key Findings
Two-phase Verification achieved the highest overall average AUROC among methods tested.
Entropy and lexical similarity methods perform inconsistently and often poorly in medical QA.
Larger models improved UE performance in these experiments.
Two-phase Verification produced more stable results across datasets.
Results
Overall average AUROC (all datasets, both sizes)
Average AUROC (Llama 2 Chat 13b)
Average AUROC (Llama 2 Chat 7b)
Stability (overall SD)
Who Should Care
What To Try In 7 Days
Run Two-phase Verification on a small set of real medical prompts to compare flagged vs. known incorrect answers.
Compare Two-phase uncertainty scores against simple entropy or lexical checks to see practical improvement.
Use few-shot templates for verification question generation and inspect failures to improve prompts quickly.
Reproducibility
Data Urls
- PubMedQA, MedQA, MedMCQA (referenced datasets)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Quality of verification questions can miss needed context or rely on pronoun resolution.
- Method performance depends on the model's medical knowledge; general models may lack depth.
- Experiments limited to two Llama 2 Chat sizes and three datasets; broader generalization is untested.
- Dense retrieval from generic sources (Wikipedia) often returned low-relevance evidence.
When Not To Use
- If the base model has very weak domain knowledge (small LMs), since verification answers may be unreliable.
- When you cannot form per-step verification questions (very short or ambiguous explanations).
- If low-latency is required, because Two-phase doubles verification calls and entailment checks.
Failure Modes
- The model answers verification questions consistently but both answers are jointly wrong (false negative).
- Independent answer introduces extra or missing details, causing false inconsistency flags.
- Entailment model misclassifies paraphrases as inconsistent, producing false positives.
- Poor retrieval yields irrelevant context, degrading independent verification quality.
Core Entities
Models
- Llama 2 Chat (7b)
- Llama 2 Chat (13b)
- DeBERTa-large (used for bidirectional entailment check)
Metrics
- AUROC
Datasets
- PubMedQA
- MedQA
- MedMCQA

