Train a language model to follow feedback by conditioning on ranked model outputs and natural-language feedback

February 6, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

27

Authors

Hao Liu, Carmelo Sferrazza, Pieter Abbeel

Links

Abstract / PDF

Why It Matters For Business

CoH gives a cheaper path to human-aligned outputs: you can use existing human preference labels and a simple finetuning loop instead of training a reward model and running RL, speeding iteration and reducing engineering risk.

Summary TLDR

Chain of Hindsight (CoH) is a simple finetuning method that turns human preference labels into short natural-language feedback chains. During training the model is conditioned on one or more model outputs plus their feedback (e.g., “Bad: … Good: …”) and learns to generate improved outputs. On summarization and dialogue datasets CoH beats supervised finetuning variants and RLHF in automatic metrics and pairwise human judgments, scales better with model size, and can incorporate free-form language feedback or simple binary tokens.

Problem Statement

Existing ways to learn from human preferences either use only positively-rated examples (supervised finetuning) or require RL optimization over a learned reward (RLHF). Both have limits: SFT wastes negative data; RLHF adds a hard reward learning and RL optimization step. The paper asks: can we leverage all preference data and natural-language feedback with a simple finetuning objective?

Main Contribution

Chain of Hindsight (CoH): a training recipe that concatenates model outputs with human feedback as an autoregressive input and finetunes with standard next-token loss.

Show that CoH learns from both positive and negative examples (using templated or free-form language feedback) and outperforms SFT variants and PPO-based RLHF on summarization and dialogue human evaluations.

Empirical analyses: human pairwise comparisons, automatic ROUGE scores, ablation on language feedback, and scaling trends across model sizes.

Key Findings

CoH wins the majority of pairwise human comparisons on summarization versus the pretrained base model.

NumbersCoH chosen 57.5% vs Base 19.9% (∆ +37.6 pp) on summarization human eval

CoH surpasses RLHF in human preference on summarization and dialogue.

NumbersSummarization: CoH 45.3% vs RLHF 30.8% (∆ +14.5 pp); Dialogue average: CoH 36.9% vs RLHF 23.4% (∆ +13.5 pp)

Natural-language feedback helps modestly.

NumbersCoH with language feedback 45.3% vs CoH w/o LF 42.4% (summarization human avg)

CoH scales better with model size: small models may not benefit, larger models benefit more.

NumbersScaling trend in Figure 5 shows CoH underperforms small models but outperforms SFT/RLHF on larger sizes

CoH reduces alignment tax on few-shot tasks compared to SFT.

NumbersAverage few-shot benchmark: CoH 40.95–43.14 vs SFT 40.54–42.53 (small improvements in several settings)

Results

Summarization human avg win rate vs pretrained base

ValueCoH 57.5% | Base 19.9% | Tie 22.6%

BaselinePretrained base

Summarization human avg win rate vs RLHF

ValueCoH 45.3% | RLHF 30.8% | Tie 24.0%

BaselineRLHF (PPO reward learning)

Dialogue human avg win rate vs pretrained base

ValueCoH 49.5% | Base 15.2% | Tie 35.3%

BaselinePretrained base

Ablation: language feedback effect on summarization

ValueCoH w/ LF 45.3% | CoH w/o LF 42.4% | RLHF 30.8%

BaselineCoH w/o LF

Automatic summarization (ROUGE)

ValueCoH > RLHF and SFT across ROUGE metrics

BaselineSFT and RLHF

Who Should Care

What To Try In 7 Days

Run a small CoH finetuning: condition a 1–6B decoder model on pairs of outputs + templated feedback and finetune with standard cross-entropy.

Use existing preference datasets (WebGPT, HH, summarize_from_feedback) and a small held-out human pairwise test to compare with SFT.

Start with templated feedback ('Bad:','Good:') before designing richer language feedback to keep engineering simple.

Agent Features

Memory

  • short-term conditioning on previous model outputs

Frameworks

  • conditional finetuning with feedback prefixes

Architectures

  • decoder-only transformer

Optimization Features

Training Optimization

  • mask 0–5% of past tokens to avoid copying
  • mix human feedback data with pretraining data as regularizer

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • CoH creates longer training sequences which increase GPU/TPU memory and compute.
  • Experiments rely on existing preference datasets and hired labelers; generalization to noisy or online feedback is untested.
  • Templated feedback was used in most experiments; gains from richer human-written feedback are modest in reported ablations.

When Not To Use

  • You have no preference data or only single positive examples.
  • You must optimize for the smallest possible model where CoH showed limited benefit.
  • You lack compute budget for longer input sequences during finetuning.

Failure Modes

  • Model can 'copy' conditioned examples instead of learning task; authors mitigate this with random past-token masking.
  • Overfitting to the specific feedback templates or dataset biases, producing brittle behavior outside training distribution.
  • Human judge bias in pairwise comparisons can inflate perceived gains if labeler pool is not representative.

Core Entities

Models

  • GPT-J-6B
  • OPT
  • Koala

Metrics

  • Pairwise human preference win rate (%)
  • ROUGE (summarization)
  • Accuracy

Datasets

  • WebGPT (webgpt_comparisons)
  • Anthropic HH (hh-rlhf)
  • Summarize_from_feedback (TL;DR filtered)

Benchmarks

  • TL;DR summarization
  • Anthropic HH dialogue
  • LM Evaluation Harness few-shot tasks