Overview
The survey compiles many scalable KD recipes and tests robustness; methods are practically useful but require per‑model tuning and access constraints (white‑box vs black‑box).
Citations2
Evidence Strength0.70
Confidence0.88
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
Distillation cuts LLM inference cost and memory while keeping most capabilities; picking the right KD style (white‑box, black‑box, CoT, retrieval‑augmented) matters for accuracy, robustness, and deployment budget.
Who Should Care
Summary TLDR
This is a 28‑page survey that organizes knowledge distillation (KD) for large language models (LLMs) into white‑box (logits and feature/hint) and black‑box (in‑context, chain‑of‑thought, instruction) methods. It reviews algorithm families, compares evaluation practices, runs unified robustness tests (adversarial and out‑of‑distribution) across teacher/student families, and surveys real applications in healthcare, education and law. The paper highlights method trade‑offs, gaps in unified benchmarks, and practical recommendations for choosing KD strategies.
Problem Statement
LLMs are powerful but expensive to run. We need reliable ways to shrink them (speed, memory) without losing key capabilities. Existing KD work started on smaller models and lacks unified evaluations and practical guidance for LLMs that are often closed‑source or multi‑capability.
Main Contribution
Taxonomy of KD for LLMs: white‑box (logits, hint) vs black‑box (ICL, CoT, instruction)
Survey and comparison of representative distillation methods and empirical tasks
Key Findings
Hint‑based (feature) distillation often transfers richer information and yields higher task accuracy than logits‑only KD.
Sequence and reverse‑KL based KD (MINILLM) improve generation robustness and long responses versus standard KD.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| GLUE performance retention | DistilBERT retains ~97% of BERT | BERTbase | −3% relative | GLUE | DistilBERT reported ~97% performance vs BERT on GLUE | Table 1 |
| SQuAD/GLUE retention | MiniLM retains >99% accuracy | BERTbase | ≈0% relative | SQuAD2 / GLUE | MiniLM retained >99% on SQuAD2 and GLUE with 50% Transformer params | Sec.3.1.2 |
What To Try In 7 Days
Run a quick distillation baseline: logits KD + finetune a small student on task data to measure accuracy drop
If teacher internals exist, try hint/feature KD; otherwise collect LLM prompts or CoT outputs for black‑box distillation
Benchmark student on adversarial and OOD checks (AdvGLUE/ANLI or a small domain holdout) to catch robustness regressions
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Survey relies on published papers and some closed‑source LLM results; direct reproducibility of closed models is limited
No single unified benchmark for KD—results depend on model family, task, and data used
When Not To Use
If strict access to teacher internals is unavailable and you need layer‑wise feature transfer, white‑box KD is not possible
If you cannot afford LLM API costs, large-scale black‑box data generation is impractical
Failure Modes
Student overfits teacher quirks or low‑quality generated labels when teacher filtering is weak
KD recipe selected for one model family may underperform or worsen robustness on another family

