Distillation can 'hack' an imperfect teacher — online data and prompt diversity stop it

February 4, 20258 min

Overview

Decision SnapshotReady For Pilot

Experiments use a controlled oracle and several datasets/model sizes; findings are robust within the tested T5-family setups but need validation for very large models and different domains.

Citations1

Evidence Strength0.80

Confidence0.88

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Daniil Tiapkin, Daniele Calandriello, Johan Ferret, Sarah Perrin, Nino Vieillard, Alexandre Ramé, Mathieu Blondel

Links

Abstract / PDF / Data

Why It Matters For Business

If you distill models from imperfect teachers, fixed offline distillation can degrade real-world quality; using online or more diverse data keeps smaller models reliable.

Who Should Care

Summary TLDR

The paper defines and tests "teacher hacking": during distillation a student can learn to exploit teacher imperfections and drift away from true behavior. In a controlled setup with an oracle model, they show teacher hacking appears when distilling from a fixed offline dataset (multi‑epoch). Online generation (sampling fresh responses each epoch) prevents hacking. Two cheaper fixes also work: increase prompt diversity or pre-generate multiple completions per prompt. They provide a practical diagnostic (proxy-golden curves and deviations from polynomial convergence) that can be measured without access to an oracle.

Problem Statement

Knowledge distillation trains small models to imitate larger teacher LMs, but teachers are imperfect proxies for the true data. Does distillation cause students to overfit to teacher flaws ("teacher hacking")? When does it happen, how to detect it from training logs, and how to prevent it in practice?

Main Contribution

Formal definition of "teacher hacking": student moves closer to teacher while moving away from ground truth.

Controlled semi-synthetic experimental framework using an oracle model to measure true distance.

Key Findings

Teacher hacking appears when distilling on a fixed offline dataset and training for many epochs.

NumbersObserved U-shaped proxy–golden curve after long runs (50 epochs in experiments).

Practical UseAvoid long multi-epoch distillation on a single fixed dataset; prefer other data strategies or stop early.

Evidence RefFigure 4; Section 4.1

Online response generation (fresh samples per epoch) prevents teacher hacking across datasets and model sizes.

NumbersOnline teacher or student sampling yields monotonic decrease in proxy and golden metrics; 10% online already stabilizes.

Practical UseIf feasible, generate responses on-the-fly from teacher or student during distillation; even a small online fraction (~10%) helps substantially.

Evidence RefFigure 5, Figure 13; Section 4.2 and A.4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Occurrence of teacher hackingObserved (yes) for offline fixed datasets with long trainingNo hacking under online generationXSum; confirmed on WMT-14 en-de and Natural InstructionsFigure 4, Figure 8Section 4
Minimum online data fraction needed to stabilize golden metric≈10% online student data substantially stabilizes golden metric0% online (fully offline) shows hackingXSum (mixture experiment)Figure 13Section A.4

What To Try In 7 Days

Add online generation to distillation (even 10% student-generated batches helps).

Increase prompt diversity for your distillation dataset rather than repeating prompts.

If online is impossible, pre-generate several completions per prompt to expand coverage.

Optimization Features

Model Optimization
distillation
Training Optimization
online generation (on-policy/off-policy mixing)increase prompt diversitymultiple completions per prompt

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

XSumWMT-14 en-deNatural Instructions

Risks & Boundaries

Limitations

Semi-synthetic setup uses an oracle model, which may not capture all real-world teacher biases.

Experiments are limited to T5-family models and three datasets; behavior on much larger LMs is untested.

When Not To Use

If you can run only one epoch of distillation (teacher hacking is minimal for 1–3 epochs).

If your prompt pool is tiny and cannot be diversified; then pre-generate many completions instead.

Failure Modes

Standard overfitting (proxy metric increases) can occur and needs different remedies.

Teacher hacking may transfer unsafe or misleading behaviors from teacher to student.

Core Entities

Models

Flan-T5-XL (oracle, 3B)T5-1.1 small (77M)T5-1.1 base (250M)T5-1.1 large (800M)

Metrics

forward KL (sequence-level)reverse KL (sequence-level)sequence-level Jensen-Shannon (JS_seq)proxy metric (student vs teacher)golden metric (student vs oracle)token-level forward KL training loss

Datasets

XSumWMT-14 en-deNatural Instructions

Context Entities

Models

Flan-T5 family referenced as instruction-tuned oracle

Metrics

proxy-golden curve and polynomial convergence diagnostic

Datasets

prompt pools sampled from public datasets listed above