Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.3
Citation Count
0
Why It Matters For Business
If you use LLM search or prompt optimization to improve accuracy, relying on standard generation-confidence scores risks optimizing for diverse but wrong outputs; invest in correctness-specific checks or new uncertainty estimators.
Summary TLDR
The paper builds a benchmark that measures how well common text-generation uncertainty metrics (token-likelihood and verbalized confidence) estimate four uncertainty types needed for prompt optimization: Answer, Correctness, Aleatoric, and Epistemic. Using tree-structured sampling with GPT-3.5-Turbo and Llama-3.1-8B on GSM8K and StrategyQA, the study finds current metrics track answer diversity (Answer Uncertainty) and related aleatoric/epistemic signals, but systematically fail to estimate correctness uncertainty. The gap implies you should not rely on standard generation-confidence scores to guide prompt-search algorithms aimed at finding correct answers.
Problem Statement
Search-based prompt optimization (MCTS, bandits, gradient search) needs uncertainty estimates that reflect the search objective (e.g., correctness). Existing NLG uncertainty metrics focus on token/sentence likelihood or verbalized confidence and mainly measure output diversity, which may not guide prompt search toward correct answers.
Main Contribution
Define four target uncertainties for prompt optimization: Answer, Correctness, Aleatoric, Epistemic, with simple formulas and practical roles.
Introduce a benchmarking pipeline that builds tree-structured reasoning traces by perturbing inputs and sampling many outputs to produce ground-truth uncertainty per node.
Evaluate four common black-box NLG metrics (NPE, LNPE, Top-DISP, Intra/verbalized) on GPT-3.5-Turbo and Meta-Llama-3.1-8B across GSM8K and StrategyQA and report correlations with target uncertainties.
Key Findings
Token-likelihood and similar metrics correlate well with Answer Uncertainty (they measure answer diversity and model output variability).
The same metrics fail to estimate Correctness Uncertainty (likelihood of being correct).
Token-likelihood metrics are highly inter-correlated; verbalized-confidence (Intra) behaves differently.
Results
Correlation to Answer Uncertainty (AnsU)
Correlation to Correctness Uncertainty (CU)
Mutual correlation among token-likelihood metrics
Who Should Care
What To Try In 7 Days
Run the paper's sampling pipeline on a handful of your task prompts to check whether your uncertainty metric correlates with correctness.
Avoid using token-likelihood metrics alone to drive search when accuracy matters; add verification (e.g., unit checks, external validators).
When exploring prompt-space, measure both answer diversity and correctness separately and log their correlations.
Reproducibility
Code Urls
- github link
Data Urls
- github link
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluations run on two reasoning datasets (GSM8K, StrategyQA) only; results may differ on other tasks.
- Ground-truth uncertainties depend on the sampling density (M,K); accuracy of ground truth improves with more samples.
- Only four black-box metrics were tested; private or white-box methods (e.g., model internals) were not evaluated.
When Not To Use
- Don't use the evaluated NLG metrics alone when your prompt optimizer needs to find correct answers.
- Don't assume high generation confidence implies accuracy in domains where correctness matters (e.g., medical or legal).
Failure Modes
- Optimization driven by answer-diversity metrics can prefer varied but incorrect outputs.
- High inter-correlation among token-likelihood metrics can give a false sense of metric diversity.
- Verbalized confidence may not align with likelihood-based signals and can mislead hybrid strategies.
Core Entities
Models
- gpt-3.5-turbo
- meta-llama-3.1-8b-instruct
Metrics
- Normalized Predictive Entropy (NPE)
- Length-Normalized Predictive Entropy (LNPE)
- TopK-Token Disparity (Top-DISP)
- Intra-Sample Similarity / verbalized confidence (Intra)
Datasets
- GSM8K
- StrategyQA

