Overview
SSR is straightforward to implement (LoRA + precompute base outputs + LLM judge). It needs extra inference and a judge LLM, so cost rises, but experiments across several datasets show consistent gains in preserving general skills.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/3
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
SSR lets you specialize a model for a task without erasing its existing skills, reducing risk when deploying fine-tuned LLMs across multiple use cases.
Who Should Care
Summary TLDR
Selective Self-Rehearsal (SSR) is a simple fine-tuning recipe: run the base model on each train example, use an LLM judge to mark which model outputs are already acceptable, then fine-tune using the model's outputs for those examples and the gold labels for the rest. On content-grounded QA tasks, SSR matches or beats standard supervised fine-tuning (SFT) on the task while preserving the base model's general skills. For example, SFT caused average drops of up to 16.7% on standard benchmarks; SSR trimmed that drop to about 2% on the same tests.
Problem Statement
Fine-tuning on gold labels often overfits and erases useful skills learned earlier. Many inputs admit multiple valid outputs, yet standard SFT forces the gold label and shifts the model away from its prior output distribution. This hurts generalization on other datasets and benchmarks.
Main Contribution
Introduce Selective Self-Rehearsal (SSR): fine-tune on model-generated outputs when they are judged acceptable, otherwise use gold labels.
Operationalize correctness with an LLM-as-a-judge to pick which training examples use model outputs.
Key Findings
SSR sharply reduces catastrophic forgetting on broad benchmarks compared to SFT.
SSR preserves answer quality (token-level recall) better than SFT in-domain and out-domain.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Avg performance change vs base on standard benchmarks (MMLU, TruthfulQA, GSM8k, Hellaswag) | SFT -16.7%; SSR -2.3% (trained on MD2D) | base model prompt | -14.4 percentage points (SFT vs SSR) | Table 6 aggregated | SFT causes large avg drops across benchmarks; SSR keeps drops near 2% | Table 6 |
| Modified recall (quality + answerability) on NQ in-domain | SFT 71.2; SSR 74.7 | base model prompt (49.3 mod. recall) | SSR +3.5 over SFT | NQ test set | SSR improves or matches SFT on overall recall while maintaining token recall | Table 3 |
What To Try In 7 Days
Run your base model on your fine-tune data and sample 1k outputs.
Use a strong LLM (or small rule set) to mark which outputs are acceptable.
Fine-tune with LoRA: use model outputs for accepted cases and gold labels otherwise; validate on an unrelated benchmark to check forgetting.
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Requires running the base model on the entire training dataset, which adds significant inference cost.
Accuracy depends on the LLM-as-a-judge; judge errors can mislabel correct/incorrect outputs.
When Not To Use
When gold answers are unique and unambiguous (no valid alternative outputs).
When you cannot afford the extra inference and judge LLM costs.
Failure Modes
Lenient judge marks wrong model outputs as acceptable, causing the model to learn errors.
Strict judge marks valid model outputs as incorrect, reducing SSR to near-standard SFT and losing benefits.

