Ask the model to judge its own answers so you can abstain when it's likely wrong

December 14, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.35

Cost Impact Score

0.4

Citation Count

2

Authors

Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, Balaji Lakshminarayanan

Links

Abstract / PDF

Why It Matters For Business

You can reduce harmful or low-quality outputs by having models score their own answers and abstain when confidence is low; this improves trust without expensive human labels.

Summary TLDR

Sequence-level probabilities from LLMs are poor indicators of free-form answer quality. The paper converts open-ended outputs into token-level self-evaluation tasks (multiple-choice or true/false) and proposes hybrid scoring (select best candidate, then pointwise-evaluate it). On TRUTHFULQA and TL;DR with PaLM-2 and GPT-3, token-level self-eval scores substantially improve ranking calibration (Calibration-AUC) and selective generation (abstain-to-improve) compared to sequence likelihood. Self-critique + revise further boosts results. Expect 1–2× extra inference cost.

Problem Statement

Sequence-level likelihoods of LLM outputs do not reliably indicate output quality for free-form generation, so we need a practical way for models to score their own outputs to decide when to abstain or return an "I don't know".

Main Contribution

Reformulate free-form generation scoring as token-level self-evaluation tasks to leverage LLM token calibration.

Define Sample and Select (multi-choice), Sample and Eval (pointwise true/false), and a Hybrid method combining both.

Add a 'NONE OF THE ABOVE' (nota) option to reduce overconfidence when no sampled candidate is correct.

Introduce calibration-focused evaluation for selective generation: Calibration-AUC and Selective-AUC.

Empirically show large gains in calibration and selective AUC on TRUTHFULQA and TL;DR using PaLM-2 and GPT-3.

Key Findings

Token-level self-evaluation greatly improves calibration versus sequence likelihood on TRUTHFULQA (PaLM-2).

NumbersCalibration-AUC: sequence 39.80% -> Hybrid w/ nota 75.34%

Self-eval methods increase selection accuracy on TRUTHFULQA (PaLM-2).

NumbersAccuracy: sequence 48.23% -> Sample and Eval 59.12% (+10.9pp)

Combining self-evaluation with self-critique+revise improves both accuracy and calibration further.

NumbersSample and Eval after revise: Accuracy 66.34%, Calibration-AUC 70.55% (Table 2)

Token-level scoring also helps summarization; improvements hold on TL;DR.

NumbersTL;DR Calibration-AUC: sequence 49.75% -> Sample and Eval w/ candidates 55.19%

Results

Accuracy

ValueSequence likelihood 48.23% | Sample and Eval 59.12% | Hybrid w/ nota 58.14%

BaselineSequence likelihood 48.23%

Calibration-AUC (TRUTHFULQA, PaLM-2)

ValueSequence likelihood 39.80% | Sample and Eval 73.79% | Hybrid w/ nota 75.34%

BaselineSequence likelihood 39.80%

Selective-AUC (TRUTHFULQA, PaLM-2)

ValueSequence likelihood 33.63% | Sample and Eval 58.19% | Hybrid w/ nota 58.10%

BaselineSequence likelihood 33.63%

Calibration-AUC (TL;DR, PaLM-2)

ValueSequence likelihood 49.75% | Sample and Eval w/ candidates 55.19%

BaselineSequence likelihood 49.75%

Who Should Care

What To Try In 7 Days

Sample multiple candidate outputs (n=4) from your LLM and implement pointwise self-evaluation prompts to score each candidate.

Add a 'NONE OF THE ABOVE' option to let the model signal uncertainty and penalize confidence accordingly.

Combine a selection pass (choose best candidate) with a pointwise evaluation pass (score chosen answer) to get stable confidence scores.

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Adds ~1–2× inference cost (hybrid mode needs extra evaluation pass).
  • Evaluation uses automatic judges (GPT-judge and a reward model) rather than full human labels.
  • De-biasing by averaging permutations is computationally expensive and not always used.
  • Some GPT-3 API limitations prevented evaluating all methods on that model.

When Not To Use

  • When inference cost or latency budgets prohibit extra model calls.
  • If you cannot trust the automatic judge or lack reliable ground-truth labels.
  • When you need a single-shot low-cost output without sampling multiple candidates.

Failure Modes

  • Position bias: candidate ordering can change multi-choice scores unless de-biased.
  • Probability dispersion: multiple correct answers dilute softmax mass across true candidates.
  • No-true-answer overconfidence: model forced to pick among wrong candidates unless nota is added.
  • Judge bias: using an automatic judge (GPT-judge or reward model) can propagate its errors.

Core Entities

Models

  • PaLM-2 LARGE
  • GPT-3 (text-davinci-003)

Metrics

  • Accuracy
  • Calibration-AUC
  • Selective-AUC

Datasets

  • TRUTHFULQA
  • TL;DR

Benchmarks

  • Calibration-AUC
  • Selective-AUC