Ask the model to judge its own answers so you can abstain when it's likely wrong

Overview

Decision SnapshotNeeds Validation

The approach is simple to implement with existing LLM APIs and shows consistent calibration gains on two public benchmarks, but it raises inference cost and depends on automatic judges for labels.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 35%

Authors

Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, Balaji Lakshminarayanan

Links

Abstract / PDF

Why It Matters For Business

You can reduce harmful or low-quality outputs by having models score their own answers and abstain when confidence is low; this improves trust without expensive human labels.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

Sequence-level probabilities from LLMs are poor indicators of free-form answer quality. The paper converts open-ended outputs into token-level self-evaluation tasks (multiple-choice or true/false) and proposes hybrid scoring (select best candidate, then pointwise-evaluate it). On TRUTHFULQA and TL;DR with PaLM-2 and GPT-3, token-level self-eval scores substantially improve ranking calibration (Calibration-AUC) and selective generation (abstain-to-improve) compared to sequence likelihood. Self-critique + revise further boosts results. Expect 1–2× extra inference cost.

Problem Statement

Sequence-level likelihoods of LLM outputs do not reliably indicate output quality for free-form generation, so we need a practical way for models to score their own outputs to decide when to abstain or return an "I don't know".

Main Contribution

Reformulate free-form generation scoring as token-level self-evaluation tasks to leverage LLM token calibration.

Define Sample and Select (multi-choice), Sample and Eval (pointwise true/false), and a Hybrid method combining both.

Key Findings

Token-level self-evaluation greatly improves calibration versus sequence likelihood on TRUTHFULQA (PaLM-2).

NumbersCalibration-AUC: sequence 39.80% -> Hybrid w/ nota 75.34%

Practical UseUse self-evaluation (hybrid + 'NONE OF THE ABOVE') to better rank outputs and safely abstain low-confidence answers.

Evidence RefTable 1 (TRUTHFULQA PALM-2)

Self-eval methods increase selection accuracy on TRUTHFULQA (PaLM-2).

NumbersAccuracy: sequence 48.23% -> Sample and Eval 59.12% (+10.9pp)

Practical UseSampling multiple candidates and scoring them with a pointwise self-eval yields more correct chosen answers than sequence likelihood.

Evidence RefTable 1 (TRUTHFULQA PALM-2)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	Sequence likelihood 48.23% \| Sample and Eval 59.12% \| Hybrid w/ nota 58.14%	Sequence likelihood 48.23%	Sample and Eval +10.89pp vs sequence	TRUTHFULQA (validation)	Table 1 (TRUTHFULQA PALM-2)	Table 1
Calibration-AUC (TRUTHFULQA, PaLM-2)	Sequence likelihood 39.80% \| Sample and Eval 73.79% \| Hybrid w/ nota 75.34%	Sequence likelihood 39.80%	Hybrid w/ nota +35.54pp vs sequence	TRUTHFULQA (validation)	Table 1 (TRUTHFULQA PALM-2)	Table 1

What To Try In 7 Days

Sample multiple candidate outputs (n=4) from your LLM and implement pointwise self-evaluation prompts to score each candidate.

Add a 'NONE OF THE ABOVE' option to let the model signal uncertainty and penalize confidence accordingly.

Combine a selection pass (choose best candidate) with a pointwise evaluation pass (score chosen answer) to get stable confidence scores.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Adds ~1–2× inference cost (hybrid mode needs extra evaluation pass).

Evaluation uses automatic judges (GPT-judge and a reward model) rather than full human labels.

When Not To Use

When inference cost or latency budgets prohibit extra model calls.

If you cannot trust the automatic judge or lack reliable ground-truth labels.

Failure Modes

Position bias: candidate ordering can change multi-choice scores unless de-biased.

Probability dispersion: multiple correct answers dilute softmax mass across true candidates.

Core Entities

Models

PaLM-2 LARGEGPT-3 (text-davinci-003)

Metrics

AccuracyCalibration-AUCSelective-AUC

Datasets

TRUTHFULQATL;DR

Benchmarks

Calibration-AUCSelective-AUC

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Token-level self-evaluation greatly improves calibration versus sequence likelihood on TRUTHFULQA (PaLM-2).

Self-eval methods increase selection accuracy on TRUTHFULQA (PaLM-2).

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding