Black-box prompts plus sampling help, but LLMs stay overconfident and struggle to predict failures

June 22, 20238 min

Overview

Production Readiness

0.4

Novelty Score

0.45

Cost Impact Score

0.55

Citation Count

49

Authors

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, Bryan Hooi

Links

Abstract / PDF

Why It Matters For Business

When deploying LLMs, naive verbalized confidence is unsafe: models often claim 80–100% confidence even when wrong, so use sampling + aggregation and validate calibration before trusting outputs.

Summary TLDR

The authors define a three-part black-box framework (prompting, sampling, aggregation) to get LLMs to verbalize confidence or infer it from response variance. They evaluate many strategies across eight QA-style datasets and five models (GPT-3, GPT-3.5, GPT-4, Vicuna, LLaMA2). Main findings: LLMs verbalize high confidence (often 80–100%) and are overconfident; scaling improves calibration and failure prediction but not enough; sampling multiple answers (M≈5) plus aggregation (Avg-Conf or Pair-Rank) greatly improves failure detection for some tasks (e.g., arithmetic); white-box methods still beat black-box but by a modest margin. Authors recommend Top-K + Self-Random + Avg-Conf/Pair-Rank as a

Problem Statement

Closed-source LLMs and API-only models do not expose logits or embeddings, so we need practical black-box ways to estimate model confidence for calibration and failure detection without fine-tuning or internal access.

Main Contribution

A unified black-box framework with three components: prompting, sampling, and aggregation for confidence elicitation.

Systematic benchmark across eight datasets (commonsense, arithmetic, symbolic, ethics, professional knowledge) and five LLMs including GPT-4 and LLaMA2.

Empirical findings: pervasive overconfidence, scaling partly helps, sampling+aggregation improves failure detection, and a small white-box vs black-box gap.

Practical guidance and open-source code (Top-K + Self-Random + Avg-Conf/Pair-Rank recommended).

Key Findings

LLMs output verbalized confidences heavily skewed to high values (80–100%), causing overconfidence.

Numbersconfidence values mostly in 80–100% range; many expressed in multiples of 5

Model size/capability improves calibration and failure prediction but remains imperfect.

NumbersAverage AUROC improved ~0.513 (GPT-3) → 0.627 (GPT-4); average ECE fell (GPT-3 avg 0.52 → GPT-4 avg 0.18)

Sampling multiple responses plus aggregation sharply improves failure prediction on some tasks.

NumbersGSM8K AUROC: CoT (M=1) 0.548 → Self-Random (M=5) 0.927

Combining verbalized confidence with response agreement helps; different aggregators suit different goals.

NumbersPair-Rank reduced ECE to ~0.028 (best calibration); Avg-Conf gave best AUROC across datasets

White-box logit-based methods outperform black-box methods but the gap is modest.

NumbersAUROC gap example: 0.522 → 0.605 between black-box and white-box on some datasets

Results

Vanilla verbalized confidence (calibration)

ValueECE avg ≈ 0.52 (GPT-3) → 0.18 (GPT-4) across tasks

Failure prediction (sampling vs single)

ValueGSM8K AUROC 0.548 (CoT M=1) → 0.927 (Self-Random M=5)

BaselineCoT M=1

Aggregation effect (calibration)

ValuePair-Rank ECE mean ≈ 0.069; Consistency ECE mean ≈ 0.12 (GPT-4)

BaselineConsistency

White-box vs black-box (AUROC)

ValueExample AUROC 0.522 (black-box) → 0.605 (white-box)

Baselineblack-box verbalized/confidence

Who Should Care

What To Try In 7 Days

Run Top-K prompt + Self-Random sampling (M=5) + Avg-Conf aggregation on a representative subset of your QA tasks.

Measure ECE and AUROC and plot confidence histograms to spot overconfidence.

If logits are available, benchmark a logit-based white-box method to compare performance.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Focuses on fixed-answer QA tasks; open-ended generation and summarization are not tested.
  • Black-box methods remain worse than white-box; conclusions depend on sampled tasks and models.
  • Verbalized confidence mirrors human phrasing and can be biased by training data; not inherently trustworthy.

When Not To Use

  • Do not rely on raw verbalized confidence for safety-critical decisions.
  • Avoid for open-ended or multi-answer tasks where ground truth is ambiguous.
  • Avoid when cost prevents multiple queries (sampling M>1 required for best results).

Failure Modes

  • High verbalized confidence for wrong answers (many incorrect samples at 100%).
  • Poor failure prediction on tasks requiring professional knowledge (e.g., law).
  • Diminishing returns vs linear cost when increasing sampled responses M.
  • Prompt wording can change calibration but no single prompt consistently wins.

Core Entities

Models

  • GPT-3
  • GPT-3.5-turbo
  • GPT-4
  • Vicuna-13B
  • LLaMA 2 70B

Metrics

  • Expected Calibration Error (ECE)
  • AUROC
  • AUPRC-Positive
  • AUPRC-Negative

Datasets

  • GSM8K
  • SVAMP
  • StrategyQA
  • SportUND
  • DateUnd
  • ObjectCounting
  • Prf-Law (MMLU)
  • Biz-Ethics (MMLU)

Benchmarks

  • BigBench
  • MMLU