Large-scale benchmark shows common LLM uncertainty signals are unreliable for long scientific answers

Overview

Decision SnapshotNeeds Validation

Large-scale, multi-model experiments provide strong empirical evidence, but findings are limited to structured scientific QA, specific prompts, and open-weight models.

Citations0

Evidence Strength0.85

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 45%

Authors

Philip Müller, Nicholas Popovič, Michael Färber, Peter Steinbach

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you show model confidence to users or gate answers by uncertainty, current off-the-shelf signals (token probs, verbal reports) can be misleading for multi-step scientific answers; sampling-based consistency is more dependable but costly.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

The authors build an open benchmarking framework and run 685,000 long-form LLM answers over seven science and math QA datasets to test uncertainty estimation (UQ) and calibration. They find token-level probabilities become polarized after instruction tuning, ECE can be misleading, verbalized self-reports and P(True) are poorly correlated with correctness, Claim-Conditioned Probability (CCP) collapses on long outputs, and Frequency-of-Answer (sampling + semantic clustering) gives the most reliable sequence-level signal but at high cost.

Problem Statement

Current UQ methods for large language models are weakly validated on long-form, reasoning-heavy scientific QA. We need to know which uncertainty scores actually track correctness for answers that involve multi-step reasoning.

Main Contribution

A large open benchmark and reproducible framework for calibration-focused UQ in long-form scientific QA.

Systematic comparison of token-level, verbalized, semantic-consistency, and CCP methods across up to 20 models and seven datasets (685,000 responses).

Key Findings

Token-level probabilities become highly polarized after instruction tuning, concentrating nearly all probability on one label.

NumbersTask-level runs: 685,000 responses; prompt task-comprehension mean up to 0.9912 for instruct models

Practical UseDo not treat raw token probabilities from instruction-tuned models as reliable per-instance uncertainty; rely on sequence-level checks or re-normalization.

Evidence RefSections 6, A.8.2; Figure 1 and A.2

Expected Calibration Error (ECE) can mislead when models report consistently high confidences.

NumbersExample: ECE 0.1198 with AUROC 0.6489 despite weak instance-level alignment

Practical UseAlways pair ECE with calibration plots and complementary metrics (AUROC, visual bins) instead of using ECE alone.

Evidence RefSection 6.3; Figure 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Total long-form responses evaluated	685,000	—	—	all datasets and models	Section 2, 5.1, A.5	Sections 2, A.5
Responses per model (sequence experiment)	57,500	—	—	per model (10 samples × subsampled items)	Section 7.2, A.5	Section 7.2

What To Try In 7 Days

Add calibration plots and AUROC alongside ECE when evaluating model confidences.

Implement small-sample frequency-of-answer (5–10 draws) for high-stakes queries and cluster answers semantically.

Avoid relying on verbalized self-confidence or single-token label probabilities for instruction-tuned models.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/muelphil/llm-uncertainty-bench

Data URLs

https://github.com/muelphil/llm-uncertainty-bench (raw scores and visualizations)

Risks & Boundaries

Limitations

Focus on structured scientific QA and multiple-choice/arithmetic tasks limits generalization to open-ended generation and other domains.

Only normalized, sequence-level UQ methods were benchmarked; many unnormalized or ensemble methods were excluded.

When Not To Use

Do not use token-level token probabilities as per-instance uncertainty for instruction-tuned models.

Avoid CCP multiplicative aggregation for long answers without reworked aggregation.

Failure Modes

Token-probability polarization: nearly all probability mass on a single token hides uncertainty.

ECE confounding: high accuracy or collapsed scores can mask poor instance-level calibration.

Core Entities

Models

gpt-oss-20bgpt-oss-120bLlama-3.1-70BLlama-3.3-70B-InstructMistral-Nemo-Base-2407Mistral-Nemo-Instruct-2407Mistral-Small-3.1-24B-Base-2503Mistral-Small-3.2-24B-Instruct-2506Magistral-Small-2507Qwen3-30B-A3B-BaseQwen3-30B-A3B-Instruct-2507Qwen3-30B-A3B-Thinking-2507DeepSeek-R1-Distill-Llama-70BDeepSeek-R1-Distill-Qwen-32Bgemma-3-27b-it

Metrics

Expected Calibration Error (ECE)AUROCFrequency of AnswerVerbalized UncertaintyP(True)Claim-Conditioned Probability (CCP)

Datasets

MMLUARC (Easy and Challenge)SciQGPQAGSM8KGSM-MCSVAMPSciBench

Benchmarks

llm-uncertainty-bench (repository benchmark)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Token-level probabilities become highly polarized after instruction tuning, concentrating nearly all probability on one label.

Expected Calibration Error (ECE) can mislead when models report consistently high confidences.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding