Overview
Production Readiness
0.4
Novelty Score
0.45
Cost Impact Score
0.55
Citation Count
49
Why It Matters For Business
When deploying LLMs, naive verbalized confidence is unsafe: models often claim 80–100% confidence even when wrong, so use sampling + aggregation and validate calibration before trusting outputs.
Summary TLDR
The authors define a three-part black-box framework (prompting, sampling, aggregation) to get LLMs to verbalize confidence or infer it from response variance. They evaluate many strategies across eight QA-style datasets and five models (GPT-3, GPT-3.5, GPT-4, Vicuna, LLaMA2). Main findings: LLMs verbalize high confidence (often 80–100%) and are overconfident; scaling improves calibration and failure prediction but not enough; sampling multiple answers (M≈5) plus aggregation (Avg-Conf or Pair-Rank) greatly improves failure detection for some tasks (e.g., arithmetic); white-box methods still beat black-box but by a modest margin. Authors recommend Top-K + Self-Random + Avg-Conf/Pair-Rank as a
Problem Statement
Closed-source LLMs and API-only models do not expose logits or embeddings, so we need practical black-box ways to estimate model confidence for calibration and failure detection without fine-tuning or internal access.
Main Contribution
A unified black-box framework with three components: prompting, sampling, and aggregation for confidence elicitation.
Systematic benchmark across eight datasets (commonsense, arithmetic, symbolic, ethics, professional knowledge) and five LLMs including GPT-4 and LLaMA2.
Empirical findings: pervasive overconfidence, scaling partly helps, sampling+aggregation improves failure detection, and a small white-box vs black-box gap.
Practical guidance and open-source code (Top-K + Self-Random + Avg-Conf/Pair-Rank recommended).
Key Findings
LLMs output verbalized confidences heavily skewed to high values (80–100%), causing overconfidence.
Model size/capability improves calibration and failure prediction but remains imperfect.
Sampling multiple responses plus aggregation sharply improves failure prediction on some tasks.
Combining verbalized confidence with response agreement helps; different aggregators suit different goals.
White-box logit-based methods outperform black-box methods but the gap is modest.
Results
Vanilla verbalized confidence (calibration)
Failure prediction (sampling vs single)
Aggregation effect (calibration)
White-box vs black-box (AUROC)
Who Should Care
What To Try In 7 Days
Run Top-K prompt + Self-Random sampling (M=5) + Avg-Conf aggregation on a representative subset of your QA tasks.
Measure ECE and AUROC and plot confidence histograms to spot overconfidence.
If logits are available, benchmark a logit-based white-box method to compare performance.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Focuses on fixed-answer QA tasks; open-ended generation and summarization are not tested.
- Black-box methods remain worse than white-box; conclusions depend on sampled tasks and models.
- Verbalized confidence mirrors human phrasing and can be biased by training data; not inherently trustworthy.
When Not To Use
- Do not rely on raw verbalized confidence for safety-critical decisions.
- Avoid for open-ended or multi-answer tasks where ground truth is ambiguous.
- Avoid when cost prevents multiple queries (sampling M>1 required for best results).
Failure Modes
- High verbalized confidence for wrong answers (many incorrect samples at 100%).
- Poor failure prediction on tasks requiring professional knowledge (e.g., law).
- Diminishing returns vs linear cost when increasing sampled responses M.
- Prompt wording can change calibration but no single prompt consistently wins.
Core Entities
Models
- GPT-3
- GPT-3.5-turbo
- GPT-4
- Vicuna-13B
- LLaMA 2 70B
Metrics
- Expected Calibration Error (ECE)
- AUROC
- AUPRC-Positive
- AUPRC-Negative
Datasets
- GSM8K
- SVAMP
- StrategyQA
- SportUND
- DateUnd
- ObjectCounting
- Prf-Law (MMLU)
- Biz-Ethics (MMLU)
Benchmarks
- BigBench
- MMLU

