Ask the model to judge its own answers so you can abstain when it's likely wrong

December 14, 20237 min

Overview

Decision SnapshotNeeds Validation

The approach is simple to implement with existing LLM APIs and shows consistent calibration gains on two public benchmarks, but it raises inference cost and depends on automatic judges for labels.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 35%

Authors

Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, Balaji Lakshminarayanan

Links

Abstract / PDF

Why It Matters For Business

You can reduce harmful or low-quality outputs by having models score their own answers and abstain when confidence is low; this improves trust without expensive human labels.

Who Should Care

Summary TLDR

Sequence-level probabilities from LLMs are poor indicators of free-form answer quality. The paper converts open-ended outputs into token-level self-evaluation tasks (multiple-choice or true/false) and proposes hybrid scoring (select best candidate, then pointwise-evaluate it). On TRUTHFULQA and TL;DR with PaLM-2 and GPT-3, token-level self-eval scores substantially improve ranking calibration (Calibration-AUC) and selective generation (abstain-to-improve) compared to sequence likelihood. Self-critique + revise further boosts results. Expect 1–2× extra inference cost.

Problem Statement

Sequence-level likelihoods of LLM outputs do not reliably indicate output quality for free-form generation, so we need a practical way for models to score their own outputs to decide when to abstain or return an "I don't know".

Main Contribution

Reformulate free-form generation scoring as token-level self-evaluation tasks to leverage LLM token calibration.

Define Sample and Select (multi-choice), Sample and Eval (pointwise true/false), and a Hybrid method combining both.

Key Findings

Token-level self-evaluation greatly improves calibration versus sequence likelihood on TRUTHFULQA (PaLM-2).

NumbersCalibration-AUC: sequence 39.80% -> Hybrid w/ nota 75.34%

Practical UseUse self-evaluation (hybrid + 'NONE OF THE ABOVE') to better rank outputs and safely abstain low-confidence answers.

Evidence RefTable 1 (TRUTHFULQA PALM-2)

Self-eval methods increase selection accuracy on TRUTHFULQA (PaLM-2).

NumbersAccuracy: sequence 48.23% -> Sample and Eval 59.12% (+10.9pp)

Practical UseSampling multiple candidates and scoring them with a pointwise self-eval yields more correct chosen answers than sequence likelihood.

Evidence RefTable 1 (TRUTHFULQA PALM-2)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracySequence likelihood 48.23% | Sample and Eval 59.12% | Hybrid w/ nota 58.14%Sequence likelihood 48.23%Sample and Eval +10.89pp vs sequenceTRUTHFULQA (validation)Table 1 (TRUTHFULQA PALM-2)Table 1
Calibration-AUC (TRUTHFULQA, PaLM-2)Sequence likelihood 39.80% | Sample and Eval 73.79% | Hybrid w/ nota 75.34%Sequence likelihood 39.80%Hybrid w/ nota +35.54pp vs sequenceTRUTHFULQA (validation)Table 1 (TRUTHFULQA PALM-2)Table 1

What To Try In 7 Days

Sample multiple candidate outputs (n=4) from your LLM and implement pointwise self-evaluation prompts to score each candidate.

Add a 'NONE OF THE ABOVE' option to let the model signal uncertainty and penalize confidence accordingly.

Combine a selection pass (choose best candidate) with a pointwise evaluation pass (score chosen answer) to get stable confidence scores.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Adds ~1–2× inference cost (hybrid mode needs extra evaluation pass).

Evaluation uses automatic judges (GPT-judge and a reward model) rather than full human labels.

When Not To Use

When inference cost or latency budgets prohibit extra model calls.

If you cannot trust the automatic judge or lack reliable ground-truth labels.

Failure Modes

Position bias: candidate ordering can change multi-choice scores unless de-biased.

Probability dispersion: multiple correct answers dilute softmax mass across true candidates.

Core Entities

Models

PaLM-2 LARGEGPT-3 (text-davinci-003)

Metrics

AccuracyCalibration-AUCSelective-AUC

Datasets

TRUTHFULQATL;DR

Benchmarks

Calibration-AUCSelective-AUC