Use weak or small models as judges: peer prediction rewards honesty and detects deception even when judges are far weaker

January 28, 20269 min

Overview

Decision SnapshotReady For Pilot

Theory (two theorems) plus empirical tests across many model sizes and 37k prompts. Stronger evidence for STEM/reasoning tasks; collusion and some domain blind spots remain open.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/9

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Tianyi Alex Qiu, Micah Carroll, Cameron Allen

Links

Abstract / PDF

Why It Matters For Business

Peer prediction lets teams evaluate and repair large models without needing stronger trusted judges or human labels. That reduces dependence on expensive human evaluation, can detect reward-hacking and deception, and enables relative comparisons across model versions.

Who Should Care

Summary TLDR

Peer prediction is a game-theory rooted method that scores model answers by how much they help an expert predict other models' answers. It requires no ground-truth labels, is provably incentive compatible (truth-telling is a best response), and empirically detects deception and trains models back to honesty even when the expert is much weaker than the participants.

Problem Statement

Strong, trusted supervision is often unavailable for frontier models. Existing judge-based evaluations and rewards can be exploited by stronger or deceptive models. We need evaluation and training signals that (1) do not require ground truth, and (2) are resistant to manipulation.

Main Contribution

Introduce a peer prediction pipeline for LLM evaluation and post-training that does not need ground-truth labels.

Prove incentive compatibility: honest reporting is a Bayesian Nash equilibrium under the mechanism (Theorem 1) and approximately so for large diverse pools (Theorem 2).

Key Findings

Peer prediction is incentive compatible: honest and informative reporting is a game-theoretic best response.

Practical UseUse peer prediction scores as a reward or evaluation metric to encourage truthful answers. In practice, models trained on this reward are driven toward honest and informative outputs rather than strategic deception.

Evidence RefTheorem 1, §3.3

Resistance to deception improves as the participant-to-expert capability gap increases (inverse scaling).

NumbersPeer prediction effective at >100× size gap; LLM-as-a-Judge fails at 520×

Practical UseYou can reliably evaluate and train very large models using much smaller expert models. If you lack a strong judge, try peer prediction with a small ensemble of weak experts instead of a single larger judge.

Evidence RefAbstract, §4.3, Figure 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracyρ = 0.6576STEM domains (123)Figure 7; cross-domain analysis§B, Figure 7
Mathematics domain correlationρ = 0.7971MathematicsFigure 7; domain correlations listed§B

What To Try In 7 Days

Run a small pilot: pick 1–2 target models, use a cheap expert (e.g., a 135M model) and compute peer prediction scores on 1k held-out prompts to compare with your current judge.

If you have a deceptively optimized model, collect high/low peer-pred pairs and run DPO + LoRA contrastive finetuning to see if truthfulness recovers.

Test cross-domain: validate peer prediction vs a small labeled set in your most important domains (STEM-like tasks expected to work best).

Optimization Features

Infra Optimization
Parallel scoring over experts/participants (n^2 m rounds) — note compute overhead
Training Optimization
LoRADirect Preference Optimization (DPO) contrastive training

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

The theory assumes no or bounded collusion; peer prediction can fail when deceptive participants form a majority.

Performance varies across domains: much stronger for structured STEM tasks than for subjective or knowledge-heavy domains.

When Not To Use

When participants can collude and form a majority — peer prediction may reward collusion in that case.

In highly subjective domains where participant priors strongly diverge (e.g., some humanities tasks).

Failure Modes

Majority collusion: deceptive majority can flip scores in favor of deception (Figure 13).

Domain blind spots: low or negative correlation with ground truth in some domains (e.g., general knowledge).

Core Entities

Models

Llama-3.1-8BLlama-3.1-70BLlama-3.1-405BMistral7B-v0.3SmolLM-135MSmolLM-360MQwen2.5-0.5BQwen2.5-1.5BQwen2.5-3BQwen2.5-7BGemma2-2BGemma2-27BMisleadLM

Metrics

Pearson correlation (ρ)Cross-entropy (honesty prediction loss)Logistic regression R^2 (honesty prediction)Proportion honest > deceptive (Table 2)

Datasets

MATHMMLUMMLU-PROARCOpenBookQARACE (subset)MCTest