Large-scale benchmark shows common LLM uncertainty signals are unreliable for long scientific answers

January 30, 20268 min

Overview

Decision SnapshotNeeds Validation

Large-scale, multi-model experiments provide strong empirical evidence, but findings are limited to structured scientific QA, specific prompts, and open-weight models.

Citations0

Evidence Strength0.85

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 45%

Authors

Philip Müller, Nicholas Popovič, Michael Färber, Peter Steinbach

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you show model confidence to users or gate answers by uncertainty, current off-the-shelf signals (token probs, verbal reports) can be misleading for multi-step scientific answers; sampling-based consistency is more dependable but costly.

Who Should Care

Summary TLDR

The authors build an open benchmarking framework and run 685,000 long-form LLM answers over seven science and math QA datasets to test uncertainty estimation (UQ) and calibration. They find token-level probabilities become polarized after instruction tuning, ECE can be misleading, verbalized self-reports and P(True) are poorly correlated with correctness, Claim-Conditioned Probability (CCP) collapses on long outputs, and Frequency-of-Answer (sampling + semantic clustering) gives the most reliable sequence-level signal but at high cost.

Problem Statement

Current UQ methods for large language models are weakly validated on long-form, reasoning-heavy scientific QA. We need to know which uncertainty scores actually track correctness for answers that involve multi-step reasoning.

Main Contribution

A large open benchmark and reproducible framework for calibration-focused UQ in long-form scientific QA.

Systematic comparison of token-level, verbalized, semantic-consistency, and CCP methods across up to 20 models and seven datasets (685,000 responses).

Key Findings

Token-level probabilities become highly polarized after instruction tuning, concentrating nearly all probability on one label.

NumbersTask-level runs: 685,000 responses; prompt task-comprehension mean up to 0.9912 for instruct models

Practical UseDo not treat raw token probabilities from instruction-tuned models as reliable per-instance uncertainty; rely on sequence-level checks or re-normalization.

Evidence RefSections 6, A.8.2; Figure 1 and A.2

Expected Calibration Error (ECE) can mislead when models report consistently high confidences.

NumbersExample: ECE 0.1198 with AUROC 0.6489 despite weak instance-level alignment

Practical UseAlways pair ECE with calibration plots and complementary metrics (AUROC, visual bins) instead of using ECE alone.

Evidence RefSection 6.3; Figure 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Total long-form responses evaluated685,000all datasets and modelsSection 2, 5.1, A.5Sections 2, A.5
Responses per model (sequence experiment)57,500per model (10 samples × subsampled items)Section 7.2, A.5Section 7.2

What To Try In 7 Days

Add calibration plots and AUROC alongside ECE when evaluating model confidences.

Implement small-sample frequency-of-answer (5–10 draws) for high-stakes queries and cluster answers semantically.

Avoid relying on verbalized self-confidence or single-token label probabilities for instruction-tuned models.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Focus on structured scientific QA and multiple-choice/arithmetic tasks limits generalization to open-ended generation and other domains.

Only normalized, sequence-level UQ methods were benchmarked; many unnormalized or ensemble methods were excluded.

When Not To Use

Do not use token-level token probabilities as per-instance uncertainty for instruction-tuned models.

Avoid CCP multiplicative aggregation for long answers without reworked aggregation.

Failure Modes

Token-probability polarization: nearly all probability mass on a single token hides uncertainty.

ECE confounding: high accuracy or collapsed scores can mask poor instance-level calibration.

Core Entities

Models

gpt-oss-20bgpt-oss-120bLlama-3.1-70BLlama-3.3-70B-InstructMistral-Nemo-Base-2407Mistral-Nemo-Instruct-2407Mistral-Small-3.1-24B-Base-2503Mistral-Small-3.2-24B-Instruct-2506Magistral-Small-2507Qwen3-30B-A3B-BaseQwen3-30B-A3B-Instruct-2507Qwen3-30B-A3B-Thinking-2507DeepSeek-R1-Distill-Llama-70BDeepSeek-R1-Distill-Qwen-32Bgemma-3-27b-it

Metrics

Expected Calibration Error (ECE)AUROCFrequency of AnswerVerbalized UncertaintyP(True)Claim-Conditioned Probability (CCP)

Datasets

MMLUARC (Easy and Challenge)SciQGPQAGSM8KGSM-MCSVAMPSciBench

Benchmarks

llm-uncertainty-bench (repository benchmark)