Overview
Large-scale, multi-model experiments provide strong empirical evidence, but findings are limited to structured scientific QA, specific prompts, and open-weight models.
Citations0
Evidence Strength0.85
Confidence0.90
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 45%
Why It Matters For Business
If you show model confidence to users or gate answers by uncertainty, current off-the-shelf signals (token probs, verbal reports) can be misleading for multi-step scientific answers; sampling-based consistency is more dependable but costly.
Who Should Care
Summary TLDR
The authors build an open benchmarking framework and run 685,000 long-form LLM answers over seven science and math QA datasets to test uncertainty estimation (UQ) and calibration. They find token-level probabilities become polarized after instruction tuning, ECE can be misleading, verbalized self-reports and P(True) are poorly correlated with correctness, Claim-Conditioned Probability (CCP) collapses on long outputs, and Frequency-of-Answer (sampling + semantic clustering) gives the most reliable sequence-level signal but at high cost.
Problem Statement
Current UQ methods for large language models are weakly validated on long-form, reasoning-heavy scientific QA. We need to know which uncertainty scores actually track correctness for answers that involve multi-step reasoning.
Main Contribution
A large open benchmark and reproducible framework for calibration-focused UQ in long-form scientific QA.
Systematic comparison of token-level, verbalized, semantic-consistency, and CCP methods across up to 20 models and seven datasets (685,000 responses).
Key Findings
Token-level probabilities become highly polarized after instruction tuning, concentrating nearly all probability on one label.
Expected Calibration Error (ECE) can mislead when models report consistently high confidences.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Total long-form responses evaluated | 685,000 | — | — | all datasets and models | Section 2, 5.1, A.5 | Sections 2, A.5 |
| Responses per model (sequence experiment) | 57,500 | — | — | per model (10 samples × subsampled items) | Section 7.2, A.5 | Section 7.2 |
What To Try In 7 Days
Add calibration plots and AUROC alongside ECE when evaluating model confidences.
Implement small-sample frequency-of-answer (5–10 draws) for high-stakes queries and cluster answers semantically.
Avoid relying on verbalized self-confidence or single-token label probabilities for instruction-tuned models.
Reproducibility
Risks & Boundaries
Limitations
Focus on structured scientific QA and multiple-choice/arithmetic tasks limits generalization to open-ended generation and other domains.
Only normalized, sequence-level UQ methods were benchmarked; many unnormalized or ensemble methods were excluded.
When Not To Use
Do not use token-level token probabilities as per-instance uncertainty for instruction-tuned models.
Avoid CCP multiplicative aggregation for long answers without reworked aggregation.
Failure Modes
Token-probability polarization: nearly all probability mass on a single token hides uncertainty.
ECE confounding: high accuracy or collapsed scores can mask poor instance-level calibration.

