Train a language model to follow feedback by conditioning on ranked model outputs and natural-language feedback

February 6, 20238 min

Overview

Decision SnapshotNeeds Validation

CoH is easy to implement (standard autoregressive finetune) and shows consistent human-eval gains on open preference datasets; results are strongest at medium-to-large model sizes and supported by both human judgments and automation.

Citations27

Evidence Strength0.60

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Hao Liu, Carmelo Sferrazza, Pieter Abbeel

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CoH gives a cheaper path to human-aligned outputs: you can use existing human preference labels and a simple finetuning loop instead of training a reward model and running RL, speeding iteration and reducing engineering risk.

Who Should Care

Summary TLDR

Chain of Hindsight (CoH) is a simple finetuning method that turns human preference labels into short natural-language feedback chains. During training the model is conditioned on one or more model outputs plus their feedback (e.g., “Bad: … Good: …”) and learns to generate improved outputs. On summarization and dialogue datasets CoH beats supervised finetuning variants and RLHF in automatic metrics and pairwise human judgments, scales better with model size, and can incorporate free-form language feedback or simple binary tokens.

Problem Statement

Existing ways to learn from human preferences either use only positively-rated examples (supervised finetuning) or require RL optimization over a learned reward (RLHF). Both have limits: SFT wastes negative data; RLHF adds a hard reward learning and RL optimization step. The paper asks: can we leverage all preference data and natural-language feedback with a simple finetuning objective?

Main Contribution

Chain of Hindsight (CoH): a training recipe that concatenates model outputs with human feedback as an autoregressive input and finetunes with standard next-token loss.

Show that CoH learns from both positive and negative examples (using templated or free-form language feedback) and outperforms SFT variants and PPO-based RLHF on summarization and dialogue human evaluations.

Key Findings

CoH wins the majority of pairwise human comparisons on summarization versus the pretrained base model.

NumbersCoH chosen 57.5% vs Base 19.9% (∆ +37.6 pp) on summarization human eval

Practical UseIf you replace a pretrained model with a CoH-finetuned one, human raters preferred CoH outputs far more often in summarization tests.

Evidence RefTable 1 (Summarization human eval average win rate)

CoH surpasses RLHF in human preference on summarization and dialogue.

NumbersSummarization: CoH 45.3% vs RLHF 30.8% (∆ +14.5 pp); Dialogue average: CoH 36.9% vs RLHF 23.4% (∆ +13.5 pp)

Practical UseYou can often get better human-aligned outputs without training a reward model and running RL by using CoH finetuning.

Evidence RefTable 1 and Table 2 (average win rates vs RLHF)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Summarization human avg win rate vs pretrained baseCoH 57.5% | Base 19.9% | Tie 22.6%Pretrained base+37.6 pp (CoH vs Base)TL;DR summarization validationPairwise human evaluation (accuracy, coherence, coverage averaged)Table 1
Summarization human avg win rate vs RLHFCoH 45.3% | RLHF 30.8% | Tie 24.0%RLHF (PPO reward learning)+14.5 pp (CoH vs RLHF)TL;DR summarization validationPairwise human evaluation averageTable 1

What To Try In 7 Days

Run a small CoH finetuning: condition a 1–6B decoder model on pairs of outputs + templated feedback and finetune with standard cross-entropy.

Use existing preference datasets (WebGPT, HH, summarize_from_feedback) and a small held-out human pairwise test to compare with SFT.

Start with templated feedback ('Bad:','Good:') before designing richer language feedback to keep engineering simple.

Agent Features

Memory
short-term conditioning on previous model outputs
Frameworks
conditional finetuning with feedback prefixes
Architectures
decoder-only transformer

Optimization Features

Training Optimization
mask 0–5% of past tokens to avoid copyingmix human feedback data with pretraining data as regularizer

Reproducibility

Risks & Boundaries

Limitations

CoH creates longer training sequences which increase GPU/TPU memory and compute.

Experiments rely on existing preference datasets and hired labelers; generalization to noisy or online feedback is untested.

When Not To Use

You have no preference data or only single positive examples.

You must optimize for the smallest possible model where CoH showed limited benefit.

Failure Modes

Model can 'copy' conditioned examples instead of learning task; authors mitigate this with random past-token masking.

Overfitting to the specific feedback templates or dataset biases, producing brittle behavior outside training distribution.

Core Entities

Models

GPT-J-6BOPTKoala

Metrics

Pairwise human preference win rate (%)ROUGE (summarization)Accuracy

Datasets

WebGPT (webgpt_comparisons)Anthropic HH (hh-rlhf)Summarize_from_feedback (TL;DR filtered)

Benchmarks

TL;DR summarizationAnthropic HH dialogueLM Evaluation Harness few-shot tasks