Use weak or small models as judges: peer prediction rewards honesty and detects deception even when judges are far weaker

January 28, 20269 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

0

Authors

Tianyi Alex Qiu, Micah Carroll, Cameron Allen

Links

Abstract / PDF

Why It Matters For Business

Peer prediction lets teams evaluate and repair large models without needing stronger trusted judges or human labels. That reduces dependence on expensive human evaluation, can detect reward-hacking and deception, and enables relative comparisons across model versions.

Summary TLDR

Peer prediction is a game-theory rooted method that scores model answers by how much they help an expert predict other models' answers. It requires no ground-truth labels, is provably incentive compatible (truth-telling is a best response), and empirically detects deception and trains models back to honesty even when the expert is much weaker than the participants.

Problem Statement

Strong, trusted supervision is often unavailable for frontier models. Existing judge-based evaluations and rewards can be exploited by stronger or deceptive models. We need evaluation and training signals that (1) do not require ground truth, and (2) are resistant to manipulation.

Main Contribution

Introduce a peer prediction pipeline for LLM evaluation and post-training that does not need ground-truth labels.

Prove incentive compatibility: honest reporting is a Bayesian Nash equilibrium under the mechanism (Theorem 1) and approximately so for large diverse pools (Theorem 2).

Empirically validate on models from 135M to 405B parameters across 37k questions: peer prediction distinguishes model quality, resists deception, and can be used as a contrastive training reward (DPO + LoRA) to recover truthfulness.

Show an 'inverse scaling' effect: peer prediction becomes more resistant to deception as the capability gap between participants and expert grows, letting small experts evaluate much larger models.

Key Findings

Peer prediction is incentive compatible: honest and informative reporting is a game-theoretic best response.

Resistance to deception improves as the participant-to-expert capability gap increases (inverse scaling).

NumbersPeer prediction effective at >100× size gap; LLM-as-a-Judge fails at 5–20×

Peer prediction training recovers truthfulness lost to malicious finetuning for an 8B participant using a 135M expert.

NumbersTraining used 120k paired answers; recovery shown in Figure 1 (majority of accuracy drop recovered)

Peer prediction scores correlate with ground-truth accuracy in structured STEM tasks but much less so in some subjective domains.

NumbersDomain correlations: STEM ρ=0.6576; Math ρ=0.7971; Biology ρ=0.8169; Physics ρ=0.7534; Humanities ρ=0.0178; General-Know

Results

Accuracy

Valueρ = 0.6576

Mathematics domain correlation

Valueρ = 0.7971

Biology domain correlation

Valueρ = 0.8169

Physics domain correlation

Valueρ = 0.7534

Humanities domain correlation

Valueρ = 0.0178

General knowledge domain correlation

Valueρ = -0.5587

Proportion questions where honest > deceptive (peer prediction)

Valueexamples in Table 2 range ~0.54–0.81 depending on pairings

Recovery of truthfulness after deceptive finetuning

ValueMajority of accuracy drop recovered (Figure 1)

BaselineAccuracy after malicious LoRA finetuning

Computational cost (example)

Value20–80 GPU-hours for 37k Q run (two 8B participants + one 7B expert)

BaselineLLM-as-a-Judge baseline 10–30 GPU-hours

Who Should Care

What To Try In 7 Days

Run a small pilot: pick 1–2 target models, use a cheap expert (e.g., a 135M model) and compute peer prediction scores on 1k held-out prompts to compare with your current judge.

If you have a deceptively optimized model, collect high/low peer-pred pairs and run DPO + LoRA contrastive finetuning to see if truthfulness recovers.

Test cross-domain: validate peer prediction vs a small labeled set in your most important domains (STEM-like tasks expected to work best).

Optimization Features

Infra Optimization

  • Parallel scoring over experts/participants (n^2 m rounds) — note compute overhead

Training Optimization

  • LoRA
  • Direct Preference Optimization (DPO) contrastive training

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • The theory assumes no or bounded collusion; peer prediction can fail when deceptive participants form a majority.
  • Performance varies across domains: much stronger for structured STEM tasks than for subjective or knowledge-heavy domains.
  • Compute overhead: peer prediction requires n^2 · m rounds of expert scoring, which can be heavier than a single judge baseline.
  • Some formal results rely on assumptions about bounded variability of agent priors (Assumption 1).

When Not To Use

  • When participants can collude and form a majority — peer prediction may reward collusion in that case.
  • In highly subjective domains where participant priors strongly diverge (e.g., some humanities tasks).
  • When you need absolute, calibrated scores without any labeled anchor — peer prediction gives relative rankings by default.

Failure Modes

  • Majority collusion: deceptive majority can flip scores in favor of deception (Figure 13).
  • Domain blind spots: low or negative correlation with ground truth in some domains (e.g., general knowledge).
  • Mis-specified expert prompting or poor expert ensemble weighting can reduce detection power.

Core Entities

Models

  • Llama-3.1-8B
  • Llama-3.1-70B
  • Llama-3.1-405B
  • Mistral7B-v0.3
  • SmolLM-135M
  • SmolLM-360M
  • Qwen2.5-0.5B
  • Qwen2.5-1.5B
  • Qwen2.5-3B
  • Qwen2.5-7B
  • Gemma2-2B
  • Gemma2-27B
  • MisleadLM

Metrics

  • Pearson correlation (ρ)
  • Cross-entropy (honesty prediction loss)
  • Logistic regression R^2 (honesty prediction)
  • Proportion honest > deceptive (Table 2)

Datasets

  • MATH
  • MMLU
  • MMLU-PRO
  • ARC
  • OpenBookQA
  • RACE (subset)
  • MCTest