Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.8
Citation Count
2
Why It Matters For Business
Distillation cuts LLM inference cost and memory while keeping most capabilities; picking the right KD style (white‑box, black‑box, CoT, retrieval‑augmented) matters for accuracy, robustness, and deployment budget.
Summary TLDR
This is a 28‑page survey that organizes knowledge distillation (KD) for large language models (LLMs) into white‑box (logits and feature/hint) and black‑box (in‑context, chain‑of‑thought, instruction) methods. It reviews algorithm families, compares evaluation practices, runs unified robustness tests (adversarial and out‑of‑distribution) across teacher/student families, and surveys real applications in healthcare, education and law. The paper highlights method trade‑offs, gaps in unified benchmarks, and practical recommendations for choosing KD strategies.
Problem Statement
LLMs are powerful but expensive to run. We need reliable ways to shrink them (speed, memory) without losing key capabilities. Existing KD work started on smaller models and lacks unified evaluations and practical guidance for LLMs that are often closed‑source or multi‑capability.
Main Contribution
Taxonomy of KD for LLMs: white‑box (logits, hint) vs black‑box (ICL, CoT, instruction)
Survey and comparison of representative distillation methods and empirical tasks
Unified robustness evaluation (adversarial and OOD) of several KD algorithms on GPT‑2, OPT, LLaMA family
Discussion of practical applications (healthcare, education, law) and open challenges (benchmarks, interpretability, multimodal KD)
Key Findings
Hint‑based (feature) distillation often transfers richer information and yields higher task accuracy than logits‑only KD.
Sequence and reverse‑KL based KD (MINILLM) improve generation robustness and long responses versus standard KD.
Black‑box CoT (chain‑of‑thought) distillation boosts multi‑step reasoning in small models, sometimes surpassing much larger models.
Distillation method effectiveness depends on model family and dataset: same KD can vary greatly across LLaMA, OPT, GPT‑2.
Retrieval‑augmented distillation can improve generalization with small latency cost.
Results
GLUE performance retention
SQuAD/GLUE retention
ReAugKD latency overhead
CoT data efficiency
Who Should Care
What To Try In 7 Days
Run a quick distillation baseline: logits KD + finetune a small student on task data to measure accuracy drop
If teacher internals exist, try hint/feature KD; otherwise collect LLM prompts or CoT outputs for black‑box distillation
Benchmark student on adversarial and OOD checks (AdvGLUE/ANLI or a small domain holdout) to catch robustness regressions
Optimization Features
Token Efficiency
- CoT distillation reduces labelled data needs (>50%)
- MiniMA shows best size ratio ~40% student size
Model Optimization
- logits-based KD
- hint/feature distillation
- sequence-level (SeqKD) and f‑divergence
- reverse‑KL generation KD
System Optimization
- avoid backprop through teacher; reduce activation memory via data/mapping choices
Training Optimization
- teacher mixed sampling
- single‑step decomposition
- length normalization
- meta‑distillation (teacher fixed, pilot updates)
Inference Optimization
- student model smaller size for lower latency
- retrieval‑augmented distillation for OOD gains
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Survey relies on published papers and some closed‑source LLM results; direct reproducibility of closed models is limited
- No single unified benchmark for KD—results depend on model family, task, and data used
- Many black‑box distillation pipelines depend on costly API calls to closed LLMs
When Not To Use
- If strict access to teacher internals is unavailable and you need layer‑wise feature transfer, white‑box KD is not possible
- If you cannot afford LLM API costs, large-scale black‑box data generation is impractical
- Avoid naive logits KD for complex generative or reasoning tasks without sequence/objective adjustments
Failure Modes
- Student overfits teacher quirks or low‑quality generated labels when teacher filtering is weak
- KD recipe selected for one model family may underperform or worsen robustness on another family
- Feature/hint KD can blow up GPU memory due to activations from large teachers
Core Entities
Models
- GPT‑3
- GPT‑3.5
- GPT‑4
- GPT‑2
- LLaMA
- LLaMA2
- OPT
- T5
- BERT
- DistilBERT
- MiniLM
- TinyBERT
- MINILLM
- MiniMA
Metrics
- ASR (Attack Success Rate)
- F1
- ROUGE‑L
- human judgment / GPT‑4 feedback
- relative % vs teacher (performance retention)
Datasets
- GLUE
- SQuAD2
- MMLU
- BIGBench
- HELM
- GSM8K
- BBH
- Dolly
- AdvGLUE
- ANLI
- Flipkart reviews
- DDXPlus
- Dolly 1
Benchmarks
- Adversarial GLUE (AdvGLUE)
- ANLI
- MMLU
- GLUE
- BBH
- GSM8K

