A simple, compute-efficient loop that generates model outputs, filters them by a learned reward, and fine-tunes the model offline to align L

August 17, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

18

Authors

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, Nando de Freitas

Links

Abstract / PDF

Why It Matters For Business

ReST boosts model alignment with human preferences using offline data reuse, cutting the compute cost of repeated online RL sampling while improving translation quality and human-rated outputs.

Summary TLDR

ReST is a practical recipe for aligning generative language models using growing batches of model-generated data. ReST alternates: (1) Grow — sample many candidate outputs from the current policy, score them with a reward model, and add them to the dataset; (2) Improve — filter candidates by a reward threshold and fine-tune the model (often with standard supervised loss). Experiments on machine translation (IWSLT, WMT, internal Web Domain) show steady gains in learned reward and in human ratings, are more compute-efficient than online RL, and are most effective when using multiple Improve steps and BC (NLL) loss. Key risks: overfitting to the learned reward and reward-model generalization as

Problem Statement

Online RLHF is compute-heavy and prone to reward hacking. Offline RL can be efficient but is limited by the initial dataset quality. How can we iteratively improve a language model with human-preference rewards while reusing generated data and reducing compute?

Main Contribution

ReST algorithm: an iterative Grow (sample) and Improve (filter + offline fine-tune) loop for aligning language models with learned rewards.

Empirical study on machine translation showing ReST improves learned reward scores and human ratings across IWSLT, WMT, and an internal Web Domain dataset.

Practical findings: (a) multiple Improve steps with increasing reward thresholds boost performance, (b) standard BC (NLL) loss often outperforms more complex offline RL losses in this setting, (c) ReST is more compute-efficient than online PPO.

Key Findings

Each additional Improve step raises the model's average reward on validation.

NumbersFigure 3: steady increases across IWSLT, WMT, Web Domain (rewards normalized 0–100)

A second Grow step can add substantial gain over the first.

NumbersIWSLT De→En: +5.3 points between first and second Grow (avg reward scale)

ReST beats supervised learning and matches or outperforms online RL when amortizing data.

NumbersTable 1: BC 70.9 → ReST (G=1,I=4) 77.8 → ReST (G=2,I=3) 83.1; Online RL 71.6

Simple BC (NLL) loss worked best among tested offline RL losses in these experiments.

NumbersFigure 5 and Appendix A.7: BC outperformed GOLD, BVMPO, OAC across settings

Human ratings improve but do not track reward-model gains exactly; reward models can misrank methods.

NumbersAll ReST variants beat BC in human scores but ranking differs from reward scores (Figure 7)

Results

Average reward (validation/test)

ValueBC (G=0,I=0): 70.9; ReST (G=1,I=0): 71.9; ReST (G=1,I=4): 77.8; ReST (G=2,I=3): 83.1; Online RL: 71.6

BaselineBC (G=0,I=0) = 70.9

Best-of-N sampling impact (reward)

ValueReST with I=3 and N=200 reaches reward = 1.0 (max)

BaselineBC with N=200 (required larger N to match ReST at lower N)

Human evaluation difference vs BC

ValueAll tested ReST variants outperform BC in human ratings (average difference positive)

BaselineBC (G=0,I=0)

Who Should Care

What To Try In 7 Days

Generate 10–100 candidates per input from your current model (Grow).

Score candidates with your best automatic reward or metric (Metric X, BLEURT).

Filter top candidates (e.g., top 5–10% or threshold ≥0.8) and fine-tune with standard supervised NLL for several small steps (Improve). Monitor reward and a human subset.

Optimization Features

System Optimization

  • Decoupling generation and fine-tuning reduces repeated scoring and compute

Training Optimization

  • Offline data reuse (amortized Grow across Improve steps)
  • Progressive filtering schedule (increasing reward thresholds)

Inference Optimization

  • Best-of-N sampling used at inference to boost output quality

Reproducibility

Data Urls

  • IWSLT 2014 De-En (public dataset)
  • WMT 2020 Zh-En (public dataset)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on a good learned reward model; reward-model errors cause overfitting and misalignment.
  • Repeated Grow steps can push the policy away from original data and increase reward-model overfitting.
  • Internal Web Domain dataset used in some experiments is not public, limiting exact replication.

When Not To Use

  • You lack a robust reward model or reliable automatic metric.
  • Sampling outputs at scale is too expensive or slow for multiple Grow steps.
  • Tasks with highly stochastic rewards where threshold filtering favors high-variance outcomes.

Failure Modes

  • Overfitting to the learned reward model (reward hacking).
  • Loss of output diversity when aggressive filtering or RL reduces exploration.
  • Model collapse or forgetting original distribution if original data is not mixed during Improve.

Core Entities

Models

  • Transformer encoder-decoder
  • Behavioral cloning (BC) policy
  • BVMPO (offline V-MPO variant)
  • GOLD
  • Offline Actor Critic (OAC)
  • PPO (online baseline)
  • Reference-free reward model (Metric X)

Metrics

  • Average reward model score (normalized 0–100 or 0–1)
  • Human evaluation score (0–6 scale)
  • BLEU (reported as secondary check)

Datasets

  • IWSLT 2014 De→En
  • WMT 2020 Zh→En
  • Web Domain En→Zh (internal)

Benchmarks

  • Metric X (reference-free reward)
  • BLEURT
  • COMET