Initialize the student from the teacher and prune it slowly while distilling to keep predictions close and improve small models

February 19, 20237 min

Overview

Decision SnapshotNeeds Validation

Method shows consistent gains on standard benchmarks and reports multiple seeds; compute and implementation details are provided, but code was not released at time of writing.

Citations9

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Chen Liang, Haoming Jiang, Zheng Li, Xianfeng Tang, Bin Yin, Tuo Zhao

Links

Abstract / PDF / Data

Why It Matters For Business

HomoDistil produces smaller, better-performing BERT derivatives by pruning from the teacher while distilling; this saves storage and lowers fine-tuning costs while preserving quality—useful when you need compact models with higher accuracy than typical distilled alternatives.

Who Should Care

Summary TLDR

HomoDistil is a task-agnostic distillation method that starts the student from the full teacher, then repeatedly prunes the least-important neurons while continuing distillation. This keeps the teacher–student prediction gap small and yields stronger small BERTs. On GLUE and SQuAD, HomoBERT models (14–65M params) beat several task-agnostic baselines, with the largest gains at the smallest scales (e.g., +3.3 task-average vs TinyBERT at ~14M). Iterative, sensitivity-based pruning and per-matrix sparsity control are central to the recipe.

Problem Statement

Task-agnostic distillation often fails because a small student and large teacher make very different predictions over huge pretraining data. That prediction gap makes it hard for the student to learn general representations and reduces distillation benefits.

Main Contribution

Propose HomoDistil: initialize the student from the full teacher and iteratively prune neurons during distillation to keep prediction gap small.

Use sensitivity-based column/row importance and a cubically scheduled sparsity per weight matrix to produce structured, hardware-friendly sparsity.

Key Findings

HomoBERT-base (65M) improves GLUE average over DistilBERT-style baselines.

NumbersGLUE avg score 83.8 vs DistilBERT 82.1 on dev

Practical UseIf you compress BERT to ~65M, HomoDistil can yield better NLU accuracy than standard task-agnostic baselines; prefer it when accuracy matters over maximum speed.

Evidence RefTable 2

Iterative prune-while-distill yields large gains at small sizes.

NumbersHomoBERT-tiny avg 79.0 vs TinyBERT 14.5M avg 75.7 (+3.3)

Practical UseFor edge/embedded models (~14–17M), use iterative pruning from the teacher rather than single-shot pruning or direct tiny initialization.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GLUE average score (dev)83.8 (HomoBERT-base, 65M)82.1 (DistilBERT, 66M)+1.7GLUE development setTable 2 reports median of 5 seedsTable 2
GLUE average score (dev)79.0 (HomoBERT-tiny, 14.1M)75.7 (TinyBERT 4×312, 14.5M)+3.3GLUE development setTable 2 shows task-average scoresTable 2

What To Try In 7 Days

Reproduce: start student from your pre-trained BERT, apply iterative column/row pruning with a gradual schedule (tf between 0.5T–0.9T).

Use sensitivity-based importance or PLATON scores rather than raw magnitude to choose neurons to prune.

Evaluate on one NLU and one QA task (GLUE and SQuAD) to measure fidelity vs baseline distilled models.

Optimization Features

Model Optimization
structured neuron/column/row pruningper-weight-matrix local sparsity control
System Optimization
controls layer widths to be hardware-friendly (avoids very wide matrices)
Training Optimization
distill-while-prune loopcubically scheduled sparsity increasesensitivity-based importance scoring (first-order)
Inference Optimization
reduced FLOPs and faster inference vs BERT-base but less speedup than some tiny modelsconsistent layer widths to avoid per-layer memory bottlenecks

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

https://dumps.wikimedia.org/enwiki/BookCorpus (Zhu et al., 2015)

Risks & Boundaries

Limitations

Less raw inference speedup than some tiny distilled models because HomoDistil preserves larger backbone capacity.

Requires continual pre-training on open-domain data (Wikipedia/BookCorpus) and non-trivial compute (reported ~13 hours on 8 A100 GPUs).

When Not To Use

If you need the absolute fastest inference and smallest FLOPs at any accuracy cost.

If you cannot afford the continual pre-training compute budget to run prune-while-distill.

Failure Modes

Single-shot (one-step) pruning or starting from a pruned student causes a large initial prediction gap and poor downstream accuracy.

Using movement pruning (task-specific metric) in this task-agnostic pipeline caused divergence in experiments.

Core Entities

Models

BERT-baseHomoBERT-baseHomoBERT-smallHomoBERT-xsmallHomoBERT-tinyDistilBERTTinyBERTMiniLMMiniLMv2CoFiSparseBERTDynaBERT

Metrics

AccuracyF1Exact Match (EM)KL divergence (prediction gap)Inference speedupFLOPs

Datasets

Wikipedia (English)BookCorpusGLUESQuAD v1.1SQuAD v2.0

Benchmarks

GLUESQuAD v1.1SQuAD v2.0