Have LLMs judge and train themselves: iterative self-rewards boost instruction-following and the model's own evaluator.

January 18, 20247 min

Overview

Decision SnapshotReady For Pilot

The method reliably improves instruction-following and judge alignment in three iterations on Llama 2 70B with Open Assistant seeds, but gains vary by task and safety/robustness testing is incomplete.

Citations9

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston

Links

Abstract / PDF / Data

Why It Matters For Business

Self-rewarding training can reduce dependence on large human-preference datasets by letting an LLM generate and score its own training data, lowering labeling cost and enabling iterative improvement—but it needs monitoring for safety and domain gaps.

Who Should Care

Summary TLDR

This paper introduces "Self-Rewarding" LLMs that generate prompts, produce multiple candidate answers, and score those answers using the same model (LLM-as-a-Judge). Using Iterative DPO (pretrained → SFT+EFT → repeated self-generated preference training), fine-tuning Llama 2 70B for three iterations produced steady gains: instruction-following win rates and automatic judge alignment improved across iterations, and MT-Bench rose from 6.78 to 7.25. The method reduces reliance on large human preference datasets but needs safety checks and broader evaluation.

Problem Statement

Human preference labels and fixed reward models limit how far aligned LLMs can improve. Can an LLM act as both generator and rewarder, then iteratively train on its own judged generations to improve instruction following and its reward-modeling ability?

Main Contribution

Proposes Self-Rewarding LLMs: a single model that both generates responses and scores them via LLM-as-a-Judge prompting.

Implements an iterative pipeline (IFT+EFT → generate candidate responses → self-score → form preference pairs → DPO) called Iterative DPO for self-alignment.

Key Findings

Instruction-following win rate against GPT-4 Turbo (AlpacaEval 2.0) rose across iterations.

NumbersM1 9.94% → M2 15.38% → M3 20.44%

Practical UseYou can bootstrap better instruction behavior without huge human preference datasets by iterating self-generated preference training; expect modest but consistent leaderboard gains.

Evidence RefTable 1 (AlpacaEval 2.0)

Reward-model alignment with human rankings improved each iteration.

NumbersPairwise accuracy: SFT 65.1% → M1 78.7% → M2 80.4% → M3 81.7%

Practical UseThe model's own ability to judge its outputs gets better through self-training, so later iterations produce higher-quality preference data for further training.

Evidence RefTable 4 (reward modeling metrics)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AlpacaEval win rate vs GPT-4 TurboM1 9.94%; M2 15.38%; M3 20.44%GPT-4 TurboM3 +10.5 pp vs M1AlpacaEval 2.0 (805 prompts)Table 1: iteration win rates on AlpacaEval 2.0Table 1
AccuracySFT 65.1%; M1 78.7%; M2 80.4%; M3 81.7%SFT Baseline (IFT-only)M3 +16.6 pp vs SFTOpen Assistant-derived EFT evaluation setTable 4: pairwise accuracy increases across iterationsTable 4

What To Try In 7 Days

Run one iteration: fine-tune a production model on a small IFT seed, add EFT examples, generate candidate responses and self-score them, then apply DPO with the top/bottom pairs.

Test the LLM-as-a-Judge prompt (additive 5-point scoring) on held-out human-ranked data to validate judge quality before scaling.

Monitor length and task-wise gains; compare MT-Bench and a focused reasoning benchmark to spot regressions.

Agent Features

Memory
short-term (in-iteration generated examples used for next training step)
Planning
iterative self-alignment (generate → evaluate → DPO train)
Tool Use
LLM-as-a-Judge prompting (model scores candidates)Self-Instruct prompt generation
Frameworks
Instruction Fine-Tuning (IFT)Evaluation Fine-Tuning (EFT)Iterative DPO
Is Agentic

Yes

Architectures
Llama 2 70B (base model used)
Collaboration
single-model loop (no multi-agent coordination)

Optimization Features

Token Efficiency
SFT
Training Optimization
Iterative DPO preference tuningresampling EFT to reduce score skew

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Open Assistant (Köpf et al., 2023) - used as IFT/EFT seed

Risks & Boundaries

Limitations

Only three iterations and one model family (Llama 2 70B) were tried; longer-term scaling laws unknown.

Evaluation uses LLM evaluators (GPT-4) while training uses an LLM judge—possible judge–evaluator bias.

When Not To Use

In safety-critical deployments before thorough safety-specific evaluation and guardrails.

When you lack reliable seed EFT examples that teach the model how to score responses.

Failure Modes

Reward-hacking: model might learn shortcuts that raise self-scores without real quality gains.

Judge–evaluator overfitting: model may learn to please its own judge format rather than humans.

Core Entities

Models

Llama 2 70BChatLlama 70BGPT-4GPT-4 TurboClaude 2Gemini Pro

Metrics

Accuracywin rate (head-to-head)MT-Bench score (out of 10)Spearman correlationKendall tauexact match %5-best %

Datasets

Open Assistant (seed IFT/EFT)AlpacaEval 2.0 (evaluation)MT-BenchARC-EasyARC-ChallengeHellaSwagSIQAPIQAGSM8KMMLUOBQANatural Questions (NQ)

Benchmarks

AlpacaEval 2.0MT-BenchARCMMLUGSM8KHellaSwag