Overview
Method is practical and improves private student utility on three datasets, but experiments are single-run and require costly DP training of a large teacher, so expect engineering effort and further validation before wide production use.
Citations2
Evidence Strength0.60
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
DistilDP lets you produce a smaller, private language model with better utility than privately fine-tuning the small model directly, reducing deployment cost while respecting strong DP budgets.
Who Should Care
Summary TLDR
DistilDP trains a teacher LLM once with DP-SGD, uses that private teacher to generate synthetic text, and then trains a smaller student on the synthetic data while distilling the teacher's output distribution (soft labels). This avoids running DP-SGD on the student and reduces privacy budget splitting. On three datasets (Yelp, Big Patent, DBpedia) with a strong privacy setting (ε=2), DistilDP lowers student perplexity versus baselines (e.g., ~9 PPL gain on Big Patent). Important knobs: mix supervised and distillation losses (λ≈0.4), temperature t=1, and large synthetic datasets (50K→400K gave ~5 PPL gain). Hidden-state alignment helps further when teacher and student share compatible sizes.
Problem Statement
Compress large language models for deployment while preserving strong differential privacy. Training with DP and then compressing with standard knowledge distillation compounds utility loss. Running DP-SGD twice (teacher and student) wastes privacy budget and hurts utility. The need: a practical method that gets a small private student with less utility loss and manageable compute.
Main Contribution
DistilDP: a simple pipeline that DP-fine-tunes one teacher, generates DP synthetic text from that teacher, then trains a student on the synthetic data with distillation (soft labels) — avoiding a second DP-SGD run.
Empirical evidence that combining synthetic hard labels and teacher output distributions substantially improves student utility under strong privacy (ε=2), including ~9 PPL gain on Big Patent.
Key Findings
DistilDP substantially reduces perplexity on Big Patent versus private fine-tuning baselines.
DistilDP improves student perplexity on Yelp compared to private finetuning.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Perplexity (Yelp test) | 44.15 (DistilDP, DistilGPT2, ε=2) | 48.12 (DP-SGD DistilGPT2, ε=2) | −3.97 | Yelp test | Table 2 (reported PPLs) | Table 2 |
| Perplexity (Big Patent test) | 32.43 (DistilDP, DistilGPT2, ε=2) | 41.8 (DP-SGD DistilGPT2, ε=2) | −9.37 | Big Patent test | Table 2 (reported PPLs) | Table 2 |
What To Try In 7 Days
Fine-tune a pretrained teacher with DP-SGD (allocate privacy budget to teacher only) and generate ~100k+ synthetic examples using control codes.
Train a small student on the synthetic data and distill teacher soft labels with λ≈0.4 and temperature t=1.
If teacher and student share hidden dimensions, add a small MSE loss on last hidden states to squeeze extra performance.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Teacher must be trained with DP-SGD, which is computationally and memory intensive for large models.
Small classes (rare control codes) suffer under tight DP guarantees and synthetic generation may poorly represent them.
When Not To Use
When you cannot afford the GPU/time to run DP-SGD on a large teacher.
When the dataset has many tiny classes that DP will distort heavily.
Failure Modes
Student overfits to synthetic data if the synthetic distribution deviates from true private data.
Low-quality DP synthetic generation leads to worse student utility than private finetuning alone.

