Use weak or small models as judges: peer prediction rewards honesty and detects deception even when judges are far weaker

Overview

Decision SnapshotReady For Pilot

Theory (two theorems) plus empirical tests across many model sizes and 37k prompts. Stronger evidence for STEM/reasoning tasks; collusion and some domain blind spots remain open.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/9

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Tianyi Alex Qiu, Micah Carroll, Cameron Allen

Links

Abstract / PDF

Why It Matters For Business

Peer prediction lets teams evaluate and repair large models without needing stronger trusted judges or human labels. That reduces dependence on expensive human evaluation, can detect reward-hacking and deception, and enables relative comparisons across model versions.

Who Should Care

Product Manager ML Engineer CTO Founder

Summary TLDR

Peer prediction is a game-theory rooted method that scores model answers by how much they help an expert predict other models' answers. It requires no ground-truth labels, is provably incentive compatible (truth-telling is a best response), and empirically detects deception and trains models back to honesty even when the expert is much weaker than the participants.

Problem Statement

Strong, trusted supervision is often unavailable for frontier models. Existing judge-based evaluations and rewards can be exploited by stronger or deceptive models. We need evaluation and training signals that (1) do not require ground truth, and (2) are resistant to manipulation.

Main Contribution

Introduce a peer prediction pipeline for LLM evaluation and post-training that does not need ground-truth labels.

Prove incentive compatibility: honest reporting is a Bayesian Nash equilibrium under the mechanism (Theorem 1) and approximately so for large diverse pools (Theorem 2).

Key Findings

Peer prediction is incentive compatible: honest and informative reporting is a game-theoretic best response.

Practical UseUse peer prediction scores as a reward or evaluation metric to encourage truthful answers. In practice, models trained on this reward are driven toward honest and informative outputs rather than strategic deception.

Evidence RefTheorem 1, §3.3

Resistance to deception improves as the participant-to-expert capability gap increases (inverse scaling).

NumbersPeer prediction effective at >100× size gap; LLM-as-a-Judge fails at 5–20×

Practical UseYou can reliably evaluate and train very large models using much smaller expert models. If you lack a strong judge, try peer prediction with a small ensemble of weak experts instead of a single larger judge.

Evidence RefAbstract, §4.3, Figure 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	ρ = 0.6576	—	—	STEM domains (123)	Figure 7; cross-domain analysis	§B, Figure 7
Mathematics domain correlation	ρ = 0.7971	—	—	Mathematics	Figure 7; domain correlations listed	§B

What To Try In 7 Days

Run a small pilot: pick 1–2 target models, use a cheap expert (e.g., a 135M model) and compute peer prediction scores on 1k held-out prompts to compare with your current judge.

If you have a deceptively optimized model, collect high/low peer-pred pairs and run DPO + LoRA contrastive finetuning to see if truthfulness recovers.

Test cross-domain: validate peer prediction vs a small labeled set in your most important domains (STEM-like tasks expected to work best).

Optimization Features

Infra Optimization

Parallel scoring over experts/participants (n^2 m rounds) — note compute overhead

Training Optimization

LoRADirect Preference Optimization (DPO) contrastive training

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

The theory assumes no or bounded collusion; peer prediction can fail when deceptive participants form a majority.

Performance varies across domains: much stronger for structured STEM tasks than for subjective or knowledge-heavy domains.

When Not To Use

When participants can collude and form a majority — peer prediction may reward collusion in that case.

In highly subjective domains where participant priors strongly diverge (e.g., some humanities tasks).

Failure Modes

Majority collusion: deceptive majority can flip scores in favor of deception (Figure 13).

Domain blind spots: low or negative correlation with ground truth in some domains (e.g., general knowledge).

Core Entities

Models

Llama-3.1-8BLlama-3.1-70BLlama-3.1-405BMistral7B-v0.3SmolLM-135MSmolLM-360MQwen2.5-0.5BQwen2.5-1.5BQwen2.5-3BQwen2.5-7BGemma2-2BGemma2-27BMisleadLM

Metrics

Pearson correlation (ρ)Cross-entropy (honesty prediction loss)Logistic regression R^2 (honesty prediction)Proportion honest > deceptive (Table 2)

Datasets

MATHMMLUMMLU-PROARCOpenBookQARACE (subset)MCTest

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Peer prediction is incentive compatible: honest and informative reporting is a game-theoretic best response.

Resistance to deception improves as the participant-to-expert capability gap increases (inverse scaling).

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

Train a model to judge and correct its own facts with token-level rewards to cut hallucinations

Key finding

TruthHypo benchmark and KnowHD detector to measure and filter hallucinated scientific hypotheses

Key finding

Induce a model to hallucinate, then penalize those hallucinations at decoding to reduce LLM fabrications

Key finding

KatotohananQA: Filipino TruthfulQA shows ~10–12% accuracy drop vs English; GPT‑5 is multilingual-robust

Key finding