Overview
Production Readiness
0.6
Novelty Score
0.35
Cost Impact Score
0.4
Citation Count
2
Why It Matters For Business
You can reduce harmful or low-quality outputs by having models score their own answers and abstain when confidence is low; this improves trust without expensive human labels.
Summary TLDR
Sequence-level probabilities from LLMs are poor indicators of free-form answer quality. The paper converts open-ended outputs into token-level self-evaluation tasks (multiple-choice or true/false) and proposes hybrid scoring (select best candidate, then pointwise-evaluate it). On TRUTHFULQA and TL;DR with PaLM-2 and GPT-3, token-level self-eval scores substantially improve ranking calibration (Calibration-AUC) and selective generation (abstain-to-improve) compared to sequence likelihood. Self-critique + revise further boosts results. Expect 1–2× extra inference cost.
Problem Statement
Sequence-level likelihoods of LLM outputs do not reliably indicate output quality for free-form generation, so we need a practical way for models to score their own outputs to decide when to abstain or return an "I don't know".
Main Contribution
Reformulate free-form generation scoring as token-level self-evaluation tasks to leverage LLM token calibration.
Define Sample and Select (multi-choice), Sample and Eval (pointwise true/false), and a Hybrid method combining both.
Add a 'NONE OF THE ABOVE' (nota) option to reduce overconfidence when no sampled candidate is correct.
Introduce calibration-focused evaluation for selective generation: Calibration-AUC and Selective-AUC.
Empirically show large gains in calibration and selective AUC on TRUTHFULQA and TL;DR using PaLM-2 and GPT-3.
Key Findings
Token-level self-evaluation greatly improves calibration versus sequence likelihood on TRUTHFULQA (PaLM-2).
Self-eval methods increase selection accuracy on TRUTHFULQA (PaLM-2).
Combining self-evaluation with self-critique+revise improves both accuracy and calibration further.
Token-level scoring also helps summarization; improvements hold on TL;DR.
Results
Accuracy
Calibration-AUC (TRUTHFULQA, PaLM-2)
Selective-AUC (TRUTHFULQA, PaLM-2)
Calibration-AUC (TL;DR, PaLM-2)
Who Should Care
What To Try In 7 Days
Sample multiple candidate outputs (n=4) from your LLM and implement pointwise self-evaluation prompts to score each candidate.
Add a 'NONE OF THE ABOVE' option to let the model signal uncertainty and penalize confidence accordingly.
Combine a selection pass (choose best candidate) with a pointwise evaluation pass (score chosen answer) to get stable confidence scores.
Reproducibility
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Adds ~1–2× inference cost (hybrid mode needs extra evaluation pass).
- Evaluation uses automatic judges (GPT-judge and a reward model) rather than full human labels.
- De-biasing by averaging permutations is computationally expensive and not always used.
- Some GPT-3 API limitations prevented evaluating all methods on that model.
When Not To Use
- When inference cost or latency budgets prohibit extra model calls.
- If you cannot trust the automatic judge or lack reliable ground-truth labels.
- When you need a single-shot low-cost output without sampling multiple candidates.
Failure Modes
- Position bias: candidate ordering can change multi-choice scores unless de-biased.
- Probability dispersion: multiple correct answers dilute softmax mass across true candidates.
- No-true-answer overconfidence: model forced to pick among wrong candidates unless nota is added.
- Judge bias: using an automatic judge (GPT-judge or reward model) can propagate its errors.
Core Entities
Models
- PaLM-2 LARGE
- GPT-3 (text-davinci-003)
Metrics
- Accuracy
- Calibration-AUC
- Selective-AUC
Datasets
- TRUTHFULQA
- TL;DR
Benchmarks
- Calibration-AUC
- Selective-AUC

