ReST boosts model alignment with human preferences using offline data reuse, cutting the compute cost of repeated online RL sampling while improving translation quality and human-rated outputs.
Key finding
Each additional Improve step raises the model's average reward on validation.
Numbers: Figure 3: steady increases across IWSLT, WMT, Web Domain (rewards normalized 0–100)

