Distill explicit reward models and use pessimism to stop DPO’s degenerate alignment

May 29, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, Jonathan Berant

Links

Abstract / PDF

Why It Matters For Business

If you fine-tune assistants from pairwise preferences, distilling explicit reward models (and using small ensembles) reduces brittle failures from biased or sparse preference labels while keeping offline training simple.

Summary TLDR

The paper shows that Direct Preference Optimization (DPO) can produce degenerate policies that assign near-zero probability to good training responses. It proposes reward model distillation: train an explicit reward model from preference data and train the policy so its implicit reward (log-likelihood ratio) matches the explicit reward. To handle uncertainty, optimize pessimistically over a small set or ensemble of reward models (implemented via a forward-KL regularizer). Theory proves distillation recovers the RLHF optimum given sufficient support, and pessimism prevents DPO’s infinite-reward degeneracies. Empirically on TL;DR summarization with simulated length bias, distillation—and an en

Problem Statement

Offline preference tuning like DPO is simple but can overfit: with finite preference pairs DPO’s implicit reward can blow up, producing policies that concentrate on outputs outside training data or collapse preferred-token likelihoods. We need a robust offline method that keeps DPO’s simplicity but avoids these degenerate optima.

Main Contribution

Theoretical analysis showing DPO can have degenerate global optima that assign near-zero probability to preferred training outputs.

A simple squared distillation loss that trains a policy so its implicit reward matches an explicit reward model.

A pessimistic extension that optimizes worst-case advantage over a set/ensemble of reward models via a forward-KL penalty.

Empirical evaluation on TL;DR (simulated length-bias) and Anthropic Helpfulness showing better robustness under dataset bias and modest gains in unbiased settings.

Key Findings

DPO can converge to degenerate optima that place mass off-training and drive preferred-response likelihoods near zero.

Matching the policy’s implicit reward to an explicit reward model recovers the RLHF optimum when the distillation data has sufficient support.

An ensemble-based pessimistic distillation (e-DPO) improves robustness when the training preference data is biased.

Numberse-DPO vs SFT win rate 65.8%; DPO vs SFT 64.2%; p<.01 for distillation methods when ρ≤0.5

Distillation gives modest wins in unbiased settings.

NumbersOn Anthropic Helpfulness, e-DPO wins vs SFT 65.8% vs DPO 64.2%; e-DPO vs DPO wins 49.7% (ties 3.4%)

Results

SFT

Value65.8%

BaselineSFT

SFT

Value65.6%

BaselineSFT

SFT

Value64.2%

BaselineSFT

e-DPO vs DPO win rate

Value49.7%

BaselineDPO

Significance of distillation under bias

Valuep < .01

BaselineDPO and IPO

Who Should Care

What To Try In 7 Days

Train a small explicit reward model on your preference data and compute pairwise reward differences.

Distill that reward into your policy with the squared pairwise loss (L2 on reward differences).

If labels might be biased, train 3–5 reward models with varied sampling and use a pessimistic ensemble via a forward-KL penalty during distillation.

Optimization Features

Model Optimization

  • Distillation

Training Optimization

  • Offline reward distillation
  • Pessimistic ensemble training

Reproducibility

Data Urls

  • TL;DR dataset (Stiennon et al., 2020)
  • Anthropic Helpfulness (Bai et al., 2022)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Requires training one or more explicit reward models; quality of policy depends on reward-model quality.
  • Ensemble must include a reasonable proxy for true preferences; bad ensembles can be conservative or harmful.
  • Experiments focus on summarization (TL;DR) and one helpfulness set; generalization to other tasks is not fully shown.
  • Hyperparameter sensitivity (β, γ / α) can affect results and needs tuning.

When Not To Use

  • You have reliable online RLHF infrastructure and fresh human feedback available.
  • You cannot train any plausible reward model or lack compute to train ensembles.
  • You want maximal exploration away from the reference policy (pessimism enforces conservatism).

Failure Modes

  • Over-conservatism: pessimism or poor ensemble choice can keep the policy too close to the reference.
  • If reward models are biased in the same way, distillation can propagate the bias to the policy.
  • Mis-tuned hyperparameters may under- or over-regularize and hurt alignment.

Core Entities

Models

  • Palm-2-XS
  • Gemini 1.0 Ultra (used as judge)

Metrics

  • win-rate
  • SFT
  • KL divergence to reference
  • bootstrap 95% CI

Datasets

  • TL;DR summarization (Stiennon et al., 2020)
  • Anthropic Helpfulness (Bai et al., 2022)

Context Entities

Models

  • reference policy π_ref
  • SFT

Metrics

  • statistical significance (Wald test p<.01)

Datasets

  • simulated-biased training splits D_ρ (varying length bias)