Fine-tune on the model's own correct answers to avoid forgetting and keep generality

Overview

Decision SnapshotNeeds Validation

SSR is straightforward to implement (LoRA + precompute base outputs + LLM judge). It needs extra inference and a judge LLM, so cost rises, but experiments across several datasets show consistent gains in preserving general skills.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Sonam Gupta, Yatin Nandwani, Asaf Yehudai, Mayank Mishra, Gaurav Pandey, Dinesh Raghu, Sachindra Joshi

Links

Abstract / PDF

Why It Matters For Business

SSR lets you specialize a model for a task without erasing its existing skills, reducing risk when deploying fine-tuned LLMs across multiple use cases.

Who Should Care

ML Engineer Product Manager Founder

Summary TLDR

Selective Self-Rehearsal (SSR) is a simple fine-tuning recipe: run the base model on each train example, use an LLM judge to mark which model outputs are already acceptable, then fine-tune using the model's outputs for those examples and the gold labels for the rest. On content-grounded QA tasks, SSR matches or beats standard supervised fine-tuning (SFT) on the task while preserving the base model's general skills. For example, SFT caused average drops of up to 16.7% on standard benchmarks; SSR trimmed that drop to about 2% on the same tests.

Problem Statement

Fine-tuning on gold labels often overfits and erases useful skills learned earlier. Many inputs admit multiple valid outputs, yet standard SFT forces the gold label and shifts the model away from its prior output distribution. This hurts generalization on other datasets and benchmarks.

Main Contribution

Introduce Selective Self-Rehearsal (SSR): fine-tune on model-generated outputs when they are judged acceptable, otherwise use gold labels.

Operationalize correctness with an LLM-as-a-judge to pick which training examples use model outputs.

Key Findings

SSR sharply reduces catastrophic forgetting on broad benchmarks compared to SFT.

NumbersSFT avg drop -16.7% vs SSR -2.3% (trained on MD2D) on MMLU/TruthfulQA/GSM8k/Hellaswag

Practical UseWhen fine-tuning for a specific task, use SSR to keep the model's general skills; you avoid large drops on downstream benchmarks.

Evidence RefTable 6

SSR preserves answer quality (token-level recall) better than SFT in-domain and out-domain.

NumbersNQ: Mod. Recall SFT=71.2 vs SSR=74.7; MD2D: SSR Mod. Recall 65.6 (Table 3)

Practical UseIf you care about maintaining generation quality on answerable queries, prefer SSR over standard SFT.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Avg performance change vs base on standard benchmarks (MMLU, TruthfulQA, GSM8k, Hellaswag)	SFT -16.7%; SSR -2.3% (trained on MD2D)	base model prompt	-14.4 percentage points (SFT vs SSR)	Table 6 aggregated	SFT causes large avg drops across benchmarks; SSR keeps drops near 2%	Table 6
Modified recall (quality + answerability) on NQ in-domain	SFT 71.2; SSR 74.7	base model prompt (49.3 mod. recall)	SSR +3.5 over SFT	NQ test set	SSR improves or matches SFT on overall recall while maintaining token recall	Table 3

What To Try In 7 Days

Run your base model on your fine-tune data and sample 1k outputs.

Use a strong LLM (or small rule set) to mark which outputs are acceptable.

Fine-tune with LoRA: use model outputs for accepted cases and gold labels otherwise; validate on an unrelated benchmark to check forgetting.

Optimization Features

Token Efficiency

Reduces need to replay instruction-tuning data

Infra Optimization

None; SSR adds pre-finetune inference and judge costs

System Optimization

Avoids auxiliary generative models for rehearsal

Training Optimization

Selective Self-Rehearsal (data selection to reuse model outputs)LoRA

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Requires running the base model on the entire training dataset, which adds significant inference cost.

Accuracy depends on the LLM-as-a-judge; judge errors can mislabel correct/incorrect outputs.

When Not To Use

When gold answers are unique and unambiguous (no valid alternative outputs).

When you cannot afford the extra inference and judge LLM costs.

Failure Modes

Lenient judge marks wrong model outputs as acceptable, causing the model to learn errors.

Strict judge marks valid model outputs as incorrect, reducing SSR to near-standard SFT and losing benefits.

Core Entities

Models

Mistral-instruct-v2-7BMistral-7B-Instruct-v0.2Mixtral-8x7B (used as judge)

Metrics

token-level recallmodified recallAccuracyhuman relevance (Likert 0-4)

Datasets

MultiDoc2Dial (MD2D)Natural Questions (NQ) augmentedMuSiQue (augmented)

Benchmarks

MMLUTruthfulQAGSM8kHellaswag

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SSR sharply reduces catastrophic forgetting on broad benchmarks compared to SFT.

SSR preserves answer quality (token-level recall) better than SFT in-domain and out-domain.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A two-stage fine-tuning recipe (SFT + HIPO) and a new LegalHalBench to cut legal hallucinations in LLMs

Key finding

FlowerTune: an open leaderboard to benchmark federated fine-tuning of LLMs across NLP, finance, medical and code

Key finding

Fine-tuning LLaVA VLMs on 50k biomedical image-text pairs cuts hallucinations and improves VQA on LDRT literature

Key finding

SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

Key finding

Train agents to judge actions via RL so they learn true self-reflection, not imitation

Key finding