Fine-tune on the model's own correct answers to avoid forgetting and keep generality

September 7, 20247 min

Overview

Decision SnapshotNeeds Validation

SSR is straightforward to implement (LoRA + precompute base outputs + LLM judge). It needs extra inference and a judge LLM, so cost rises, but experiments across several datasets show consistent gains in preserving general skills.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Sonam Gupta, Yatin Nandwani, Asaf Yehudai, Mayank Mishra, Gaurav Pandey, Dinesh Raghu, Sachindra Joshi

Links

Abstract / PDF

Why It Matters For Business

SSR lets you specialize a model for a task without erasing its existing skills, reducing risk when deploying fine-tuned LLMs across multiple use cases.

Who Should Care

Summary TLDR

Selective Self-Rehearsal (SSR) is a simple fine-tuning recipe: run the base model on each train example, use an LLM judge to mark which model outputs are already acceptable, then fine-tune using the model's outputs for those examples and the gold labels for the rest. On content-grounded QA tasks, SSR matches or beats standard supervised fine-tuning (SFT) on the task while preserving the base model's general skills. For example, SFT caused average drops of up to 16.7% on standard benchmarks; SSR trimmed that drop to about 2% on the same tests.

Problem Statement

Fine-tuning on gold labels often overfits and erases useful skills learned earlier. Many inputs admit multiple valid outputs, yet standard SFT forces the gold label and shifts the model away from its prior output distribution. This hurts generalization on other datasets and benchmarks.

Main Contribution

Introduce Selective Self-Rehearsal (SSR): fine-tune on model-generated outputs when they are judged acceptable, otherwise use gold labels.

Operationalize correctness with an LLM-as-a-judge to pick which training examples use model outputs.

Key Findings

SSR sharply reduces catastrophic forgetting on broad benchmarks compared to SFT.

NumbersSFT avg drop -16.7% vs SSR -2.3% (trained on MD2D) on MMLU/TruthfulQA/GSM8k/Hellaswag

Practical UseWhen fine-tuning for a specific task, use SSR to keep the model's general skills; you avoid large drops on downstream benchmarks.

Evidence RefTable 6

SSR preserves answer quality (token-level recall) better than SFT in-domain and out-domain.

NumbersNQ: Mod. Recall SFT=71.2 vs SSR=74.7; MD2D: SSR Mod. Recall 65.6 (Table 3)

Practical UseIf you care about maintaining generation quality on answerable queries, prefer SSR over standard SFT.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Avg performance change vs base on standard benchmarks (MMLU, TruthfulQA, GSM8k, Hellaswag)SFT -16.7%; SSR -2.3% (trained on MD2D)base model prompt-14.4 percentage points (SFT vs SSR)Table 6 aggregatedSFT causes large avg drops across benchmarks; SSR keeps drops near 2%Table 6
Modified recall (quality + answerability) on NQ in-domainSFT 71.2; SSR 74.7base model prompt (49.3 mod. recall)SSR +3.5 over SFTNQ test setSSR improves or matches SFT on overall recall while maintaining token recallTable 3

What To Try In 7 Days

Run your base model on your fine-tune data and sample 1k outputs.

Use a strong LLM (or small rule set) to mark which outputs are acceptable.

Fine-tune with LoRA: use model outputs for accepted cases and gold labels otherwise; validate on an unrelated benchmark to check forgetting.

Optimization Features

Token Efficiency
Reduces need to replay instruction-tuning data
Infra Optimization
None; SSR adds pre-finetune inference and judge costs
System Optimization
Avoids auxiliary generative models for rehearsal
Training Optimization
Selective Self-Rehearsal (data selection to reuse model outputs)LoRA

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Requires running the base model on the entire training dataset, which adds significant inference cost.

Accuracy depends on the LLM-as-a-judge; judge errors can mislabel correct/incorrect outputs.

When Not To Use

When gold answers are unique and unambiguous (no valid alternative outputs).

When you cannot afford the extra inference and judge LLM costs.

Failure Modes

Lenient judge marks wrong model outputs as acceptable, causing the model to learn errors.

Strict judge marks valid model outputs as incorrect, reducing SSR to near-standard SFT and losing benefits.

Core Entities

Models

Mistral-instruct-v2-7BMistral-7B-Instruct-v0.2Mixtral-8x7B (used as judge)

Metrics

token-level recallmodified recallAccuracyhuman relevance (Likert 0-4)

Datasets

MultiDoc2Dial (MD2D)Natural Questions (NQ) augmentedMuSiQue (augmented)

Benchmarks

MMLUTruthfulQAGSM8kHellaswag