Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
9
Why It Matters For Business
HomoDistil produces smaller, better-performing BERT derivatives by pruning from the teacher while distilling; this saves storage and lowers fine-tuning costs while preserving quality—useful when you need compact models with higher accuracy than typical distilled alternatives.
Summary TLDR
HomoDistil is a task-agnostic distillation method that starts the student from the full teacher, then repeatedly prunes the least-important neurons while continuing distillation. This keeps the teacher–student prediction gap small and yields stronger small BERTs. On GLUE and SQuAD, HomoBERT models (14–65M params) beat several task-agnostic baselines, with the largest gains at the smallest scales (e.g., +3.3 task-average vs TinyBERT at ~14M). Iterative, sensitivity-based pruning and per-matrix sparsity control are central to the recipe.
Problem Statement
Task-agnostic distillation often fails because a small student and large teacher make very different predictions over huge pretraining data. That prediction gap makes it hard for the student to learn general representations and reduces distillation benefits.
Main Contribution
Propose HomoDistil: initialize the student from the full teacher and iteratively prune neurons during distillation to keep prediction gap small.
Use sensitivity-based column/row importance and a cubically scheduled sparsity per weight matrix to produce structured, hardware-friendly sparsity.
Show consistent gains on GLUE and SQuAD across multiple target sizes, with largest margins at 10–20M parameter scales.
Key Findings
HomoBERT-base (65M) improves GLUE average over DistilBERT-style baselines.
Iterative prune-while-distill yields large gains at small sizes.
HomoDistil improves question answering F1 over compact baselines.
Maintaining small prediction discrepancy during distillation is key.
Choice of importance metric matters for pruning quality.
Results
GLUE average score (dev)
GLUE average score (dev)
SQuAD v2.0 F1 (validation)
Prediction discrepancy (KL)
Inference speedup (vs BERT-base)
Who Should Care
What To Try In 7 Days
Reproduce: start student from your pre-trained BERT, apply iterative column/row pruning with a gradual schedule (tf between 0.5T–0.9T).
Use sensitivity-based importance or PLATON scores rather than raw magnitude to choose neurons to prune.
Evaluate on one NLU and one QA task (GLUE and SQuAD) to measure fidelity vs baseline distilled models.
Optimization Features
Model Optimization
- structured neuron/column/row pruning
- per-weight-matrix local sparsity control
System Optimization
- controls layer widths to be hardware-friendly (avoids very wide matrices)
Training Optimization
- distill-while-prune loop
- cubically scheduled sparsity increase
- sensitivity-based importance scoring (first-order)
Inference Optimization
- reduced FLOPs and faster inference vs BERT-base but less speedup than some tiny models
- consistent layer widths to avoid per-layer memory bottlenecks
Reproducibility
Data Urls
- https://dumps.wikimedia.org/enwiki/
- BookCorpus (Zhu et al., 2015)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Less raw inference speedup than some tiny distilled models because HomoDistil preserves larger backbone capacity.
- Requires continual pre-training on open-domain data (Wikipedia/BookCorpus) and non-trivial compute (reported ~13 hours on 8 A100 GPUs).
- Layer-height (depth) pruning is not solved here; authors leave pruning depth for task-agnostic setting as open.
When Not To Use
- If you need the absolute fastest inference and smallest FLOPs at any accuracy cost.
- If you cannot afford the continual pre-training compute budget to run prune-while-distill.
- When a task-specific distillation pipeline is already optimized and you prefer per-task teachers.
Failure Modes
- Single-shot (one-step) pruning or starting from a pruned student causes a large initial prediction gap and poor downstream accuracy.
- Using movement pruning (task-specific metric) in this task-agnostic pipeline caused divergence in experiments.
- Pruning certain output projections too late increases distillation loss sharply; schedule tuning is needed.
Core Entities
Models
- BERT-base
- HomoBERT-base
- HomoBERT-small
- HomoBERT-xsmall
- HomoBERT-tiny
- DistilBERT
- TinyBERT
- MiniLM
- MiniLMv2
- CoFi
- SparseBERT
- DynaBERT
Metrics
- Accuracy
- F1
- Exact Match (EM)
- KL divergence (prediction gap)
- Inference speedup
- FLOPs
Datasets
- Wikipedia (English)
- BookCorpus
- GLUE
- SQuAD v1.1
- SQuAD v2.0
Benchmarks
- GLUE
- SQuAD v1.1
- SQuAD v2.0

