Initialize the student from the teacher and prune it slowly while distilling to keep predictions close and improve small models

February 19, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

9

Authors

Chen Liang, Haoming Jiang, Zheng Li, Xianfeng Tang, Bin Yin, Tuo Zhao

Links

Abstract / PDF

Why It Matters For Business

HomoDistil produces smaller, better-performing BERT derivatives by pruning from the teacher while distilling; this saves storage and lowers fine-tuning costs while preserving quality—useful when you need compact models with higher accuracy than typical distilled alternatives.

Summary TLDR

HomoDistil is a task-agnostic distillation method that starts the student from the full teacher, then repeatedly prunes the least-important neurons while continuing distillation. This keeps the teacher–student prediction gap small and yields stronger small BERTs. On GLUE and SQuAD, HomoBERT models (14–65M params) beat several task-agnostic baselines, with the largest gains at the smallest scales (e.g., +3.3 task-average vs TinyBERT at ~14M). Iterative, sensitivity-based pruning and per-matrix sparsity control are central to the recipe.

Problem Statement

Task-agnostic distillation often fails because a small student and large teacher make very different predictions over huge pretraining data. That prediction gap makes it hard for the student to learn general representations and reduces distillation benefits.

Main Contribution

Propose HomoDistil: initialize the student from the full teacher and iteratively prune neurons during distillation to keep prediction gap small.

Use sensitivity-based column/row importance and a cubically scheduled sparsity per weight matrix to produce structured, hardware-friendly sparsity.

Show consistent gains on GLUE and SQuAD across multiple target sizes, with largest margins at 10–20M parameter scales.

Key Findings

HomoBERT-base (65M) improves GLUE average over DistilBERT-style baselines.

NumbersGLUE avg score 83.8 vs DistilBERT 82.1 on dev

Iterative prune-while-distill yields large gains at small sizes.

NumbersHomoBERT-tiny avg 79.0 vs TinyBERT 14.5M avg 75.7 (+3.3)

HomoDistil improves question answering F1 over compact baselines.

NumbersSQuAD v2.0 F1: HomoBERT-xsmall 70.0 vs MiniLM3 66.2 (+3.8)

Maintaining small prediction discrepancy during distillation is key.

Choice of importance metric matters for pruning quality.

NumbersSensitivity/PLATON avg ~81.8–81.9 vs Magnitude 80.1 on GLUE for 17M model

Results

GLUE average score (dev)

Value83.8 (HomoBERT-base, 65M)

Baseline82.1 (DistilBERT, 66M)

GLUE average score (dev)

Value79.0 (HomoBERT-tiny, 14.1M)

Baseline75.7 (TinyBERT 4×312, 14.5M)

SQuAD v2.0 F1 (validation)

Value70.0 (HomoBERT-xsmall, 15.6M)

Baseline66.2 (MiniLM 3, 17.3M)

Prediction discrepancy (KL)

ValueRemains small when pruning schedule finishes at tf ∈ {0.5T,0.7T,0.9T}

BaselineLarge initial gap when tf = 0 (one-shot pruned init)

Inference speedup (vs BERT-base)

Value2.40× (HomoBERT-small, ~17M)

Baseline1.00× (BERT-base)

Who Should Care

What To Try In 7 Days

Reproduce: start student from your pre-trained BERT, apply iterative column/row pruning with a gradual schedule (tf between 0.5T–0.9T).

Use sensitivity-based importance or PLATON scores rather than raw magnitude to choose neurons to prune.

Evaluate on one NLU and one QA task (GLUE and SQuAD) to measure fidelity vs baseline distilled models.

Optimization Features

Model Optimization

  • structured neuron/column/row pruning
  • per-weight-matrix local sparsity control

System Optimization

  • controls layer widths to be hardware-friendly (avoids very wide matrices)

Training Optimization

  • distill-while-prune loop
  • cubically scheduled sparsity increase
  • sensitivity-based importance scoring (first-order)

Inference Optimization

  • reduced FLOPs and faster inference vs BERT-base but less speedup than some tiny models
  • consistent layer widths to avoid per-layer memory bottlenecks

Reproducibility

Data Urls

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Less raw inference speedup than some tiny distilled models because HomoDistil preserves larger backbone capacity.
  • Requires continual pre-training on open-domain data (Wikipedia/BookCorpus) and non-trivial compute (reported ~13 hours on 8 A100 GPUs).
  • Layer-height (depth) pruning is not solved here; authors leave pruning depth for task-agnostic setting as open.

When Not To Use

  • If you need the absolute fastest inference and smallest FLOPs at any accuracy cost.
  • If you cannot afford the continual pre-training compute budget to run prune-while-distill.
  • When a task-specific distillation pipeline is already optimized and you prefer per-task teachers.

Failure Modes

  • Single-shot (one-step) pruning or starting from a pruned student causes a large initial prediction gap and poor downstream accuracy.
  • Using movement pruning (task-specific metric) in this task-agnostic pipeline caused divergence in experiments.
  • Pruning certain output projections too late increases distillation loss sharply; schedule tuning is needed.

Core Entities

Models

  • BERT-base
  • HomoBERT-base
  • HomoBERT-small
  • HomoBERT-xsmall
  • HomoBERT-tiny
  • DistilBERT
  • TinyBERT
  • MiniLM
  • MiniLMv2
  • CoFi
  • SparseBERT
  • DynaBERT

Metrics

  • Accuracy
  • F1
  • Exact Match (EM)
  • KL divergence (prediction gap)
  • Inference speedup
  • FLOPs

Datasets

  • Wikipedia (English)
  • BookCorpus
  • GLUE
  • SQuAD v1.1
  • SQuAD v2.0

Benchmarks

  • GLUE
  • SQuAD v1.1
  • SQuAD v2.0