Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
DistilDP lets you produce a smaller, private language model with better utility than privately fine-tuning the small model directly, reducing deployment cost while respecting strong DP budgets.
Summary TLDR
DistilDP trains a teacher LLM once with DP-SGD, uses that private teacher to generate synthetic text, and then trains a smaller student on the synthetic data while distilling the teacher's output distribution (soft labels). This avoids running DP-SGD on the student and reduces privacy budget splitting. On three datasets (Yelp, Big Patent, DBpedia) with a strong privacy setting (ε=2), DistilDP lowers student perplexity versus baselines (e.g., ~9 PPL gain on Big Patent). Important knobs: mix supervised and distillation losses (λ≈0.4), temperature t=1, and large synthetic datasets (50K→400K gave ~5 PPL gain). Hidden-state alignment helps further when teacher and student share compatible sizes.
Problem Statement
Compress large language models for deployment while preserving strong differential privacy. Training with DP and then compressing with standard knowledge distillation compounds utility loss. Running DP-SGD twice (teacher and student) wastes privacy budget and hurts utility. The need: a practical method that gets a small private student with less utility loss and manageable compute.
Main Contribution
DistilDP: a simple pipeline that DP-fine-tunes one teacher, generates DP synthetic text from that teacher, then trains a student on the synthetic data with distillation (soft labels) — avoiding a second DP-SGD run.
Empirical evidence that combining synthetic hard labels and teacher output distributions substantially improves student utility under strong privacy (ε=2), including ~9 PPL gain on Big Patent.
A practical ablation showing best practices (λ≈0.4, temperature t=1, more synthetic data helps) and an optional hidden-representation alignment step when architectures match.
Key Findings
DistilDP substantially reduces perplexity on Big Patent versus private fine-tuning baselines.
DistilDP improves student perplexity on Yelp compared to private finetuning.
Combining soft-label distillation and supervised loss is crucial; λ≈0.4 performs best and temperature t=1 is preferred.
More synthetic data lowers perplexity; scaling synthetic samples from 50K to 400K improved PPL by about 5 points.
Aligning hidden representations narrows the student-teacher gap when architectures match.
Results
Perplexity (Yelp test)
Perplexity (Big Patent test)
Perplexity (DBpedia test)
Who Should Care
What To Try In 7 Days
Fine-tune a pretrained teacher with DP-SGD (allocate privacy budget to teacher only) and generate ~100k+ synthetic examples using control codes.
Train a small student on the synthetic data and distill teacher soft labels with λ≈0.4 and temperature t=1.
If teacher and student share hidden dimensions, add a small MSE loss on last hidden states to squeeze extra performance.
Optimization Features
Infra Optimization
- Training teacher with DP may require multi-GPU A100 class hardware
Model Optimization
- Knowledge distillation to smaller model
System Optimization
- Reduce extra DP-SGD run to lower memory and compute
Training Optimization
- Apply DP-SGD once to teacher; avoid DP-SGD on student to save privacy budget
Reproducibility
Data Urls
- https://www.yelp.com/dataset
- Big Patent (Sharma et al., 2019)
- DBpedia (Zhang et al., 2015)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Teacher must be trained with DP-SGD, which is computationally and memory intensive for large models.
- Small classes (rare control codes) suffer under tight DP guarantees and synthetic generation may poorly represent them.
- The approach assumes pretrained public models and control codes; public pretraining itself can leak private info.
When Not To Use
- When you cannot afford the GPU/time to run DP-SGD on a large teacher.
- When the dataset has many tiny classes that DP will distort heavily.
- When you require formal privacy budget accounting that must include control-code distribution handling (authors ignored small control-code privacy loss).
Failure Modes
- Student overfits to synthetic data if the synthetic distribution deviates from true private data.
- Low-quality DP synthetic generation leads to worse student utility than private finetuning alone.
- Control codes or their distribution can leak private categorical signals if not handled carefully.
Core Entities
Models
- GPT2-Large
- DistilGPT2
- GPT2
Metrics
- Perplexity
Datasets
- Yelp Open Dataset
- Big Patent
- DBpedia

