Train a language model to follow feedback by conditioning on ranked model outputs and natural-language feedback

Overview

Decision SnapshotNeeds Validation

CoH is easy to implement (standard autoregressive finetune) and shows consistent human-eval gains on open preference datasets; results are strongest at medium-to-large model sizes and supported by both human judgments and automation.

Citations27

Evidence Strength0.60

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Hao Liu, Carmelo Sferrazza, Pieter Abbeel

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CoH gives a cheaper path to human-aligned outputs: you can use existing human preference labels and a simple finetuning loop instead of training a reward model and running RL, speeding iteration and reducing engineering risk.

Who Should Care

ML Engineer Product Manager Data Scientist CTO Founder

Summary TLDR

Chain of Hindsight (CoH) is a simple finetuning method that turns human preference labels into short natural-language feedback chains. During training the model is conditioned on one or more model outputs plus their feedback (e.g., “Bad: … Good: …”) and learns to generate improved outputs. On summarization and dialogue datasets CoH beats supervised finetuning variants and RLHF in automatic metrics and pairwise human judgments, scales better with model size, and can incorporate free-form language feedback or simple binary tokens.

Problem Statement

Existing ways to learn from human preferences either use only positively-rated examples (supervised finetuning) or require RL optimization over a learned reward (RLHF). Both have limits: SFT wastes negative data; RLHF adds a hard reward learning and RL optimization step. The paper asks: can we leverage all preference data and natural-language feedback with a simple finetuning objective?

Main Contribution

Chain of Hindsight (CoH): a training recipe that concatenates model outputs with human feedback as an autoregressive input and finetunes with standard next-token loss.

Show that CoH learns from both positive and negative examples (using templated or free-form language feedback) and outperforms SFT variants and PPO-based RLHF on summarization and dialogue human evaluations.

Key Findings

CoH wins the majority of pairwise human comparisons on summarization versus the pretrained base model.

NumbersCoH chosen 57.5% vs Base 19.9% (∆ +37.6 pp) on summarization human eval

Practical UseIf you replace a pretrained model with a CoH-finetuned one, human raters preferred CoH outputs far more often in summarization tests.

Evidence RefTable 1 (Summarization human eval average win rate)

CoH surpasses RLHF in human preference on summarization and dialogue.

NumbersSummarization: CoH 45.3% vs RLHF 30.8% (∆ +14.5 pp); Dialogue average: CoH 36.9% vs RLHF 23.4% (∆ +13.5 pp)

Practical UseYou can often get better human-aligned outputs without training a reward model and running RL by using CoH finetuning.

Evidence RefTable 1 and Table 2 (average win rates vs RLHF)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Summarization human avg win rate vs pretrained base	CoH 57.5% \| Base 19.9% \| Tie 22.6%	Pretrained base	+37.6 pp (CoH vs Base)	TL;DR summarization validation	Pairwise human evaluation (accuracy, coherence, coverage averaged)	Table 1
Summarization human avg win rate vs RLHF	CoH 45.3% \| RLHF 30.8% \| Tie 24.0%	RLHF (PPO reward learning)	+14.5 pp (CoH vs RLHF)	TL;DR summarization validation	Pairwise human evaluation average	Table 1

What To Try In 7 Days

Run a small CoH finetuning: condition a 1–6B decoder model on pairs of outputs + templated feedback and finetune with standard cross-entropy.

Use existing preference datasets (WebGPT, HH, summarize_from_feedback) and a small held-out human pairwise test to compare with SFT.

Start with templated feedback ('Bad:','Good:') before designing richer language feedback to keep engineering simple.

Agent Features

Memory

short-term conditioning on previous model outputs

Frameworks

conditional finetuning with feedback prefixes

Architectures

decoder-only transformer

Optimization Features

Training Optimization

mask 0–5% of past tokens to avoid copyingmix human feedback data with pretraining data as regularizer

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/lhao499/chain-of-hindsight

Data URLs

https://huggingface.co/datasets/openai/webgpt_comparisons https://huggingface.co/datasets/Anthropic/hh-rlhf https://huggingface.co/datasets/openai/summarize_from_feedback

Risks & Boundaries

Limitations

CoH creates longer training sequences which increase GPU/TPU memory and compute.

Experiments rely on existing preference datasets and hired labelers; generalization to noisy or online feedback is untested.

When Not To Use

You have no preference data or only single positive examples.

You must optimize for the smallest possible model where CoH showed limited benefit.

Failure Modes

Model can 'copy' conditioned examples instead of learning task; authors mitigate this with random past-token masking.

Overfitting to the specific feedback templates or dataset biases, producing brittle behavior outside training distribution.

Core Entities

Models

GPT-J-6BOPTKoala

Metrics

Pairwise human preference win rate (%)ROUGE (summarization)Accuracy

Datasets

WebGPT (webgpt_comparisons)Anthropic HH (hh-rlhf)Summarize_from_feedback (TL;DR filtered)

Benchmarks

TL;DR summarizationAnthropic HH dialogueLM Evaluation Harness few-shot tasks

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CoH wins the majority of pairwise human comparisons on summarization versus the pretrained base model.

CoH surpasses RLHF in human preference on summarization and dialogue.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

A public benchmark that tests whether multimodal LLMs can judge other model outputs across scoring, pairwise, and ranking tasks.

Key finding

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Use a small assistant LLM to remove teacher-model favoritism from proxy judge training

Key finding

Use synthetic crowd comparisons to make LLM judges give deeper, more reliable chain-of-thought evaluations

Key finding