Black-box prompts plus sampling help, but LLMs stay overconfident and struggle to predict failures

Overview

Decision SnapshotNeeds Validation

The paper runs controlled experiments across many models and public datasets and provides reproducible code; results are empirical and show practical trade-offs but do not fully solve failure detection.

Citations49

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 55%

Production readiness: 40%

Novelty: 45%

Authors

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, Bryan Hooi

Links

Abstract / PDF / Code

Why It Matters For Business

When deploying LLMs, naive verbalized confidence is unsafe: models often claim 80–100% confidence even when wrong, so use sampling + aggregation and validate calibration before trusting outputs.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

The authors define a three-part black-box framework (prompting, sampling, aggregation) to get LLMs to verbalize confidence or infer it from response variance. They evaluate many strategies across eight QA-style datasets and five models (GPT-3, GPT-3.5, GPT-4, Vicuna, LLaMA2). Main findings: LLMs verbalize high confidence (often 80–100%) and are overconfident; scaling improves calibration and failure prediction but not enough; sampling multiple answers (M≈5) plus aggregation (Avg-Conf or Pair-Rank) greatly improves failure detection for some tasks (e.g., arithmetic); white-box methods still beat black-box but by a modest margin. Authors recommend Top-K + Self-Random + Avg-Conf/Pair-Rank as a

Problem Statement

Closed-source LLMs and API-only models do not expose logits or embeddings, so we need practical black-box ways to estimate model confidence for calibration and failure detection without fine-tuning or internal access.

Main Contribution

A unified black-box framework with three components: prompting, sampling, and aggregation for confidence elicitation.

Systematic benchmark across eight datasets (commonsense, arithmetic, symbolic, ethics, professional knowledge) and five LLMs including GPT-4 and LLaMA2.

Key Findings

LLMs output verbalized confidences heavily skewed to high values (80–100%), causing overconfidence.

Numbersconfidence values mostly in 80–100% range; many expressed in multiples of 5

Practical UseDo not trust raw verbalized confidences in production; plot confidence histograms and recalibrate before action.

Evidence RefFigure 2; Fig.5; Table 2

Model size/capability improves calibration and failure prediction but remains imperfect.

NumbersAverage AUROC improved ~0.513 (GPT-3) → 0.627 (GPT-4); average ECE fell (GPT-3 avg 0.52 → GPT-4 avg 0.18)

Practical UsePrefer stronger models if available, but still validate confidence behavior—scaling helps but doesn’t solve failure detection.

Evidence RefTable 2 (average AUROC and ECE across models)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Vanilla verbalized confidence (calibration)	ECE avg ≈ 0.52 (GPT-3) → 0.18 (GPT-4) across tasks	—	—	Table 2 (various datasets)	Table 2 shows high ECE for smaller models and lower ECE for GPT-4	Table 2
Failure prediction (sampling vs single)	GSM8K AUROC 0.548 (CoT M=1) → 0.927 (Self-Random M=5)	CoT M=1	+0.379 AUROC	GSM8K	Table 3 shows consistency-based aggregation with M=5 greatly improves AUROC	Table 3

What To Try In 7 Days

Run Top-K prompt + Self-Random sampling (M=5) + Avg-Conf aggregation on a representative subset of your QA tasks.

Measure ECE and AUROC and plot confidence histograms to spot overconfidence.

If logits are available, benchmark a logit-based white-box method to compare performance.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/MiaoXiong2320/llm-uncertainty

Risks & Boundaries

Limitations

Focuses on fixed-answer QA tasks; open-ended generation and summarization are not tested.

Black-box methods remain worse than white-box; conclusions depend on sampled tasks and models.

When Not To Use

Do not rely on raw verbalized confidence for safety-critical decisions.

Avoid for open-ended or multi-answer tasks where ground truth is ambiguous.

Failure Modes

High verbalized confidence for wrong answers (many incorrect samples at 100%).

Poor failure prediction on tasks requiring professional knowledge (e.g., law).

Core Entities

Models

GPT-3GPT-3.5-turboGPT-4Vicuna-13BLLaMA 2 70B

Metrics

Expected Calibration Error (ECE)AUROCAUPRC-PositiveAUPRC-Negative

Datasets

GSM8KSVAMPStrategyQASportUNDDateUndObjectCountingPrf-Law (MMLU)Biz-Ethics (MMLU)

Benchmarks

BigBenchMMLU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLMs output verbalized confidences heavily skewed to high values (80–100%), causing overconfidence.

Model size/capability improves calibration and failure prediction but remains imperfect.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding