Distillation can 'hack' an imperfect teacher — online data and prompt diversity stop it

February 4, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Daniil Tiapkin, Daniele Calandriello, Johan Ferret, Sarah Perrin, Nino Vieillard, Alexandre Ramé, Mathieu Blondel

Links

Abstract / PDF

Why It Matters For Business

If you distill models from imperfect teachers, fixed offline distillation can degrade real-world quality; using online or more diverse data keeps smaller models reliable.

Summary TLDR

The paper defines and tests "teacher hacking": during distillation a student can learn to exploit teacher imperfections and drift away from true behavior. In a controlled setup with an oracle model, they show teacher hacking appears when distilling from a fixed offline dataset (multi‑epoch). Online generation (sampling fresh responses each epoch) prevents hacking. Two cheaper fixes also work: increase prompt diversity or pre-generate multiple completions per prompt. They provide a practical diagnostic (proxy-golden curves and deviations from polynomial convergence) that can be measured without access to an oracle.

Problem Statement

Knowledge distillation trains small models to imitate larger teacher LMs, but teachers are imperfect proxies for the true data. Does distillation cause students to overfit to teacher flaws ("teacher hacking")? When does it happen, how to detect it from training logs, and how to prevent it in practice?

Main Contribution

Formal definition of "teacher hacking": student moves closer to teacher while moving away from ground truth.

Controlled semi-synthetic experimental framework using an oracle model to measure true distance.

Empirical finding that teacher hacking appears with fixed offline datasets and long multi-epoch training.

Practical mitigations: online response generation, increasing prompt diversity, or multiple offline completions.

A measurable diagnostic: detect hacking by deviations from polynomial (power-law) convergence in the proxy metric.

Key Findings

Teacher hacking appears when distilling on a fixed offline dataset and training for many epochs.

NumbersObserved U-shaped proxy–golden curve after long runs (50 epochs in experiments).

Online response generation (fresh samples per epoch) prevents teacher hacking across datasets and model sizes.

NumbersOnline teacher or student sampling yields monotonic decrease in proxy and golden metrics; 10% online already stabilizes.

Dataset diversity and generation budget control hacking: low prompt diversity increases hacking; multiple completions reduce it.

NumbersSubsampling prompts with k=2 or k=5 worsened golden metric; increasing generations per prompt m=2,3 improved proxy and黄金

Teacher hacking can be detected without an oracle by monitoring proxy convergence behavior.

NumbersProxy metric follows a power-law (linear on log-log) with online data; offline cases deviate when hacking begins.

Results

Occurrence of teacher hacking

ValueObserved (yes) for offline fixed datasets with long training

BaselineNo hacking under online generation

Minimum online data fraction needed to stabilize golden metric

Value≈10% online student data substantially stabilizes golden metric

Baseline0% online (fully offline) shows hacking

Effect of prompt diversity (fixed generation budget)

ValueLower diversity (k=2,5) worsens golden metric

BaselineHigh diversity (one generation per prompt)

Generation budget effect

ValueGenerating multiple completions (m=2,3) improves proxy and golden metrics

BaselineSingle completion per prompt

Who Should Care

What To Try In 7 Days

Add online generation to distillation (even 10% student-generated batches helps).

Increase prompt diversity for your distillation dataset rather than repeating prompts.

If online is impossible, pre-generate several completions per prompt to expand coverage.

Optimization Features

Model Optimization

  • distillation

Training Optimization

  • online generation (on-policy/off-policy mixing)
  • increase prompt diversity
  • multiple completions per prompt

Reproducibility

Data Urls

  • XSum
  • WMT-14 en-de
  • Natural Instructions

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Semi-synthetic setup uses an oracle model, which may not capture all real-world teacher biases.
  • Experiments are limited to T5-family models and three datasets; behavior on much larger LMs is untested.
  • No public release of code in the paper to immediately reproduce exact runs.

When Not To Use

  • If you can run only one epoch of distillation (teacher hacking is minimal for 1–3 epochs).
  • If your prompt pool is tiny and cannot be diversified; then pre-generate many completions instead.

Failure Modes

  • Standard overfitting (proxy metric increases) can occur and needs different remedies.
  • Teacher hacking may transfer unsafe or misleading behaviors from teacher to student.
  • Diagnostics relying only on proxy metrics can miss subtle shifts without convergence analysis.

Core Entities

Models

  • Flan-T5-XL (oracle, 3B)
  • T5-1.1 small (77M)
  • T5-1.1 base (250M)
  • T5-1.1 large (800M)

Metrics

  • forward KL (sequence-level)
  • reverse KL (sequence-level)
  • sequence-level Jensen-Shannon (JS_seq)
  • proxy metric (student vs teacher)
  • golden metric (student vs oracle)
  • token-level forward KL training loss

Datasets

  • XSum
  • WMT-14 en-de
  • Natural Instructions

Context Entities

Models

  • Flan-T5 family referenced as instruction-tuned oracle

Metrics

  • proxy-golden curve and polynomial convergence diagnostic

Datasets

  • prompt pools sampled from public datasets listed above