Overview
The paper runs controlled experiments across many models and public datasets and provides reproducible code; results are empirical and show practical trade-offs but do not fully solve failure detection.
Citations49
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 55%
Production readiness: 40%
Novelty: 45%
Why It Matters For Business
When deploying LLMs, naive verbalized confidence is unsafe: models often claim 80–100% confidence even when wrong, so use sampling + aggregation and validate calibration before trusting outputs.
Who Should Care
Summary TLDR
The authors define a three-part black-box framework (prompting, sampling, aggregation) to get LLMs to verbalize confidence or infer it from response variance. They evaluate many strategies across eight QA-style datasets and five models (GPT-3, GPT-3.5, GPT-4, Vicuna, LLaMA2). Main findings: LLMs verbalize high confidence (often 80–100%) and are overconfident; scaling improves calibration and failure prediction but not enough; sampling multiple answers (M≈5) plus aggregation (Avg-Conf or Pair-Rank) greatly improves failure detection for some tasks (e.g., arithmetic); white-box methods still beat black-box but by a modest margin. Authors recommend Top-K + Self-Random + Avg-Conf/Pair-Rank as a
Problem Statement
Closed-source LLMs and API-only models do not expose logits or embeddings, so we need practical black-box ways to estimate model confidence for calibration and failure detection without fine-tuning or internal access.
Main Contribution
A unified black-box framework with three components: prompting, sampling, and aggregation for confidence elicitation.
Systematic benchmark across eight datasets (commonsense, arithmetic, symbolic, ethics, professional knowledge) and five LLMs including GPT-4 and LLaMA2.
Key Findings
LLMs output verbalized confidences heavily skewed to high values (80–100%), causing overconfidence.
Model size/capability improves calibration and failure prediction but remains imperfect.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Vanilla verbalized confidence (calibration) | ECE avg ≈ 0.52 (GPT-3) → 0.18 (GPT-4) across tasks | — | — | Table 2 (various datasets) | Table 2 shows high ECE for smaller models and lower ECE for GPT-4 | Table 2 |
| Failure prediction (sampling vs single) | GSM8K AUROC 0.548 (CoT M=1) → 0.927 (Self-Random M=5) | CoT M=1 | +0.379 AUROC | GSM8K | Table 3 shows consistency-based aggregation with M=5 greatly improves AUROC | Table 3 |
What To Try In 7 Days
Run Top-K prompt + Self-Random sampling (M=5) + Avg-Conf aggregation on a representative subset of your QA tasks.
Measure ECE and AUROC and plot confidence histograms to spot overconfidence.
If logits are available, benchmark a logit-based white-box method to compare performance.
Reproducibility
Risks & Boundaries
Limitations
Focuses on fixed-answer QA tasks; open-ended generation and summarization are not tested.
Black-box methods remain worse than white-box; conclusions depend on sampled tasks and models.
When Not To Use
Do not rely on raw verbalized confidence for safety-critical decisions.
Avoid for open-ended or multi-answer tasks where ground truth is ambiguous.
Failure Modes
High verbalized confidence for wrong answers (many incorrect samples at 100%).
Poor failure prediction on tasks requiring professional knowledge (e.g., law).

