Black-box prompts plus sampling help, but LLMs stay overconfident and struggle to predict failures

June 22, 20238 min

Overview

Decision SnapshotNeeds Validation

The paper runs controlled experiments across many models and public datasets and provides reproducible code; results are empirical and show practical trade-offs but do not fully solve failure detection.

Citations49

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 55%

Production readiness: 40%

Novelty: 45%

Authors

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, Bryan Hooi

Links

Abstract / PDF / Code

Why It Matters For Business

When deploying LLMs, naive verbalized confidence is unsafe: models often claim 80–100% confidence even when wrong, so use sampling + aggregation and validate calibration before trusting outputs.

Who Should Care

Summary TLDR

The authors define a three-part black-box framework (prompting, sampling, aggregation) to get LLMs to verbalize confidence or infer it from response variance. They evaluate many strategies across eight QA-style datasets and five models (GPT-3, GPT-3.5, GPT-4, Vicuna, LLaMA2). Main findings: LLMs verbalize high confidence (often 80–100%) and are overconfident; scaling improves calibration and failure prediction but not enough; sampling multiple answers (M≈5) plus aggregation (Avg-Conf or Pair-Rank) greatly improves failure detection for some tasks (e.g., arithmetic); white-box methods still beat black-box but by a modest margin. Authors recommend Top-K + Self-Random + Avg-Conf/Pair-Rank as a

Problem Statement

Closed-source LLMs and API-only models do not expose logits or embeddings, so we need practical black-box ways to estimate model confidence for calibration and failure detection without fine-tuning or internal access.

Main Contribution

A unified black-box framework with three components: prompting, sampling, and aggregation for confidence elicitation.

Systematic benchmark across eight datasets (commonsense, arithmetic, symbolic, ethics, professional knowledge) and five LLMs including GPT-4 and LLaMA2.

Key Findings

LLMs output verbalized confidences heavily skewed to high values (80–100%), causing overconfidence.

Numbersconfidence values mostly in 80100% range; many expressed in multiples of 5

Practical UseDo not trust raw verbalized confidences in production; plot confidence histograms and recalibrate before action.

Evidence RefFigure 2; Fig.5; Table 2

Model size/capability improves calibration and failure prediction but remains imperfect.

NumbersAverage AUROC improved ~0.513 (GPT-3) → 0.627 (GPT-4); average ECE fell (GPT-3 avg 0.52 → GPT-4 avg 0.18)

Practical UsePrefer stronger models if available, but still validate confidence behavior—scaling helps but doesn’t solve failure detection.

Evidence RefTable 2 (average AUROC and ECE across models)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Vanilla verbalized confidence (calibration)ECE avg ≈ 0.52 (GPT-3) → 0.18 (GPT-4) across tasksTable 2 (various datasets)Table 2 shows high ECE for smaller models and lower ECE for GPT-4Table 2
Failure prediction (sampling vs single)GSM8K AUROC 0.548 (CoT M=1) → 0.927 (Self-Random M=5)CoT M=1+0.379 AUROCGSM8KTable 3 shows consistency-based aggregation with M=5 greatly improves AUROCTable 3

What To Try In 7 Days

Run Top-K prompt + Self-Random sampling (M=5) + Avg-Conf aggregation on a representative subset of your QA tasks.

Measure ECE and AUROC and plot confidence histograms to spot overconfidence.

If logits are available, benchmark a logit-based white-box method to compare performance.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Focuses on fixed-answer QA tasks; open-ended generation and summarization are not tested.

Black-box methods remain worse than white-box; conclusions depend on sampled tasks and models.

When Not To Use

Do not rely on raw verbalized confidence for safety-critical decisions.

Avoid for open-ended or multi-answer tasks where ground truth is ambiguous.

Failure Modes

High verbalized confidence for wrong answers (many incorrect samples at 100%).

Poor failure prediction on tasks requiring professional knowledge (e.g., law).

Core Entities

Models

GPT-3GPT-3.5-turboGPT-4Vicuna-13BLLaMA 2 70B

Metrics

Expected Calibration Error (ECE)AUROCAUPRC-PositiveAUPRC-Negative

Datasets

GSM8KSVAMPStrategyQASportUNDDateUndObjectCountingPrf-Law (MMLU)Biz-Ethics (MMLU)

Benchmarks

BigBenchMMLU