Have LLMs judge and train themselves: iterative self-rewards boost instruction-following and the model's own evaluator.

January 18, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

9

Authors

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston

Links

Abstract / PDF

Why It Matters For Business

Self-rewarding training can reduce dependence on large human-preference datasets by letting an LLM generate and score its own training data, lowering labeling cost and enabling iterative improvement—but it needs monitoring for safety and domain gaps.

Summary TLDR

This paper introduces "Self-Rewarding" LLMs that generate prompts, produce multiple candidate answers, and score those answers using the same model (LLM-as-a-Judge). Using Iterative DPO (pretrained → SFT+EFT → repeated self-generated preference training), fine-tuning Llama 2 70B for three iterations produced steady gains: instruction-following win rates and automatic judge alignment improved across iterations, and MT-Bench rose from 6.78 to 7.25. The method reduces reliance on large human preference datasets but needs safety checks and broader evaluation.

Problem Statement

Human preference labels and fixed reward models limit how far aligned LLMs can improve. Can an LLM act as both generator and rewarder, then iteratively train on its own judged generations to improve instruction following and its reward-modeling ability?

Main Contribution

Proposes Self-Rewarding LLMs: a single model that both generates responses and scores them via LLM-as-a-Judge prompting.

Implements an iterative pipeline (IFT+EFT → generate candidate responses → self-score → form preference pairs → DPO) called Iterative DPO for self-alignment.

Empirical study: Llama 2 70B fine-tuned with three iterations yields consistent gains in instruction-following (AlpacaEval) and reward-model alignment (pairwise accuracy with human rankings).

Key Findings

Instruction-following win rate against GPT-4 Turbo (AlpacaEval 2.0) rose across iterations.

NumbersM1 9.94% → M2 15.38% → M3 20.44%

Reward-model alignment with human rankings improved each iteration.

NumbersPairwise accuracy: SFT 65.1% → M1 78.7% → M2 80.4% → M3 81.7%

Overall MT-Bench score increased over iterations.

NumbersOverall: SFT 6.85 → M1 6.78 → M2 7.01 → M3 7.25 (scale 10)

Results

AlpacaEval win rate vs GPT-4 Turbo

ValueM1 9.94%; M2 15.38%; M3 20.44%

BaselineGPT-4 Turbo

Accuracy

ValueSFT 65.1%; M1 78.7%; M2 80.4%; M3 81.7%

BaselineSFT Baseline (IFT-only)

MT-Bench overall score

ValueSFT 6.85; M1 6.78; M2 7.01; M3 7.25 (scale 10)

BaselineSFT Baseline

Reward-model Spearman correlation with humans

ValueSFT 0.253; M1 0.279; M2 0.331; M3 0.349

BaselineSFT Baseline

Who Should Care

What To Try In 7 Days

Run one iteration: fine-tune a production model on a small IFT seed, add EFT examples, generate candidate responses and self-score them, then apply DPO with the top/bottom pairs.

Test the LLM-as-a-Judge prompt (additive 5-point scoring) on held-out human-ranked data to validate judge quality before scaling.

Monitor length and task-wise gains; compare MT-Bench and a focused reasoning benchmark to spot regressions.

Agent Features

Memory

  • short-term (in-iteration generated examples used for next training step)

Planning

  • iterative self-alignment (generate → evaluate → DPO train)

Tool Use

  • LLM-as-a-Judge prompting (model scores candidates)
  • Self-Instruct prompt generation

Frameworks

  • Instruction Fine-Tuning (IFT)
  • Evaluation Fine-Tuning (EFT)
  • Iterative DPO

Is Agentic

true

Architectures

  • Llama 2 70B (base model used)

Collaboration

  • single-model loop (no multi-agent coordination)

Optimization Features

Token Efficiency

  • SFT

Training Optimization

  • Iterative DPO preference tuning
  • resampling EFT to reduce score skew

Reproducibility

Data Urls

  • Open Assistant (Köpf et al., 2023) - used as IFT/EFT seed

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only three iterations and one model family (Llama 2 70B) were tried; longer-term scaling laws unknown.
  • Evaluation uses LLM evaluators (GPT-4) while training uses an LLM judge—possible judge–evaluator bias.
  • Seed data (Open Assistant) underrepresents math/reasoning, so gains are smaller there.
  • Generations grew longer across iterations, which may bias automatic quality judgments.

When Not To Use

  • In safety-critical deployments before thorough safety-specific evaluation and guardrails.
  • When you lack reliable seed EFT examples that teach the model how to score responses.
  • If you need immediate strong math/reasoning gains without relevant seed data.

Failure Modes

  • Reward-hacking: model might learn shortcuts that raise self-scores without real quality gains.
  • Judge–evaluator overfitting: model may learn to please its own judge format rather than humans.
  • Domain drift: self-generated data may not cover tasks outside the seed distribution (e.g., math).

Core Entities

Models

  • Llama 2 70B
  • ChatLlama 70B
  • GPT-4
  • GPT-4 Turbo
  • Claude 2
  • Gemini Pro

Metrics

  • Accuracy
  • win rate (head-to-head)
  • MT-Bench score (out of 10)
  • Spearman correlation
  • Kendall tau
  • exact match %
  • 5-best %

Datasets

  • Open Assistant (seed IFT/EFT)
  • AlpacaEval 2.0 (evaluation)
  • MT-Bench
  • ARC-Easy
  • ARC-Challenge
  • HellaSwag
  • SIQA
  • PIQA
  • GSM8K
  • MMLU
  • OBQA
  • Natural Questions (NQ)

Benchmarks

  • AlpacaEval 2.0
  • MT-Bench
  • ARC
  • MMLU
  • GSM8K
  • HellaSwag