DistilDP: use a DP-finetuned teacher to generate private synthetic text and distill a compact student without applying DP twice

March 1, 20248 min

Overview

Decision SnapshotNeeds Validation

Method is practical and improves private student utility on three datasets, but experiments are single-run and require costly DP training of a large teacher, so expect engineering effort and further validation before wide production use.

Citations2

Evidence Strength0.60

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

James Flemings, Murali Annavaram

Links

Abstract / PDF / Code / Data

Why It Matters For Business

DistilDP lets you produce a smaller, private language model with better utility than privately fine-tuning the small model directly, reducing deployment cost while respecting strong DP budgets.

Who Should Care

Summary TLDR

DistilDP trains a teacher LLM once with DP-SGD, uses that private teacher to generate synthetic text, and then trains a smaller student on the synthetic data while distilling the teacher's output distribution (soft labels). This avoids running DP-SGD on the student and reduces privacy budget splitting. On three datasets (Yelp, Big Patent, DBpedia) with a strong privacy setting (ε=2), DistilDP lowers student perplexity versus baselines (e.g., ~9 PPL gain on Big Patent). Important knobs: mix supervised and distillation losses (λ≈0.4), temperature t=1, and large synthetic datasets (50K→400K gave ~5 PPL gain). Hidden-state alignment helps further when teacher and student share compatible sizes.

Problem Statement

Compress large language models for deployment while preserving strong differential privacy. Training with DP and then compressing with standard knowledge distillation compounds utility loss. Running DP-SGD twice (teacher and student) wastes privacy budget and hurts utility. The need: a practical method that gets a small private student with less utility loss and manageable compute.

Main Contribution

DistilDP: a simple pipeline that DP-fine-tunes one teacher, generates DP synthetic text from that teacher, then trains a student on the synthetic data with distillation (soft labels) — avoiding a second DP-SGD run.

Empirical evidence that combining synthetic hard labels and teacher output distributions substantially improves student utility under strong privacy (ε=2), including ~9 PPL gain on Big Patent.

Key Findings

DistilDP substantially reduces perplexity on Big Patent versus private fine-tuning baselines.

NumbersBig Patent: DistilDP PPL 32.43 vs DP-SGD student 41.8 (−9.37 PPL)

Practical UseIf you need a private compact summarization model, use a DP-finetuned teacher to generate synthetic data and distill its output distribution to the student to recover several points of perplexity under ε=2.

Evidence RefTable 2

DistilDP improves student perplexity on Yelp compared to private finetuning.

NumbersYelp: DistilDP PPL 44.15 vs DP-SGD student 48.12 (−3.97 PPL)

Practical UseExpect modest but consistent gains on large review-like datasets by using DistilDP instead of privately fine-tuning the small model directly.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Perplexity (Yelp test)44.15 (DistilDP, DistilGPT2, ε=2)48.12 (DP-SGD DistilGPT2, ε=2)−3.97Yelp testTable 2 (reported PPLs)Table 2
Perplexity (Big Patent test)32.43 (DistilDP, DistilGPT2, ε=2)41.8 (DP-SGD DistilGPT2, ε=2)−9.37Big Patent testTable 2 (reported PPLs)Table 2

What To Try In 7 Days

Fine-tune a pretrained teacher with DP-SGD (allocate privacy budget to teacher only) and generate ~100k+ synthetic examples using control codes.

Train a small student on the synthetic data and distill teacher soft labels with λ≈0.4 and temperature t=1.

If teacher and student share hidden dimensions, add a small MSE loss on last hidden states to squeeze extra performance.

Optimization Features

Infra Optimization
Training teacher with DP may require multi-GPU A100 class hardware
Model Optimization
Knowledge distillation to smaller model
System Optimization
Reduce extra DP-SGD run to lower memory and compute
Training Optimization
Apply DP-SGD once to teacher; avoid DP-SGD on student to save privacy budget

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

https://www.yelp.com/datasetBig Patent (Sharma et al., 2019)DBpedia (Zhang et al., 2015)

Risks & Boundaries

Limitations

Teacher must be trained with DP-SGD, which is computationally and memory intensive for large models.

Small classes (rare control codes) suffer under tight DP guarantees and synthetic generation may poorly represent them.

When Not To Use

When you cannot afford the GPU/time to run DP-SGD on a large teacher.

When the dataset has many tiny classes that DP will distort heavily.

Failure Modes

Student overfits to synthetic data if the synthetic distribution deviates from true private data.

Low-quality DP synthetic generation leads to worse student utility than private finetuning alone.

Core Entities

Models

GPT2-LargeDistilGPT2GPT2

Metrics

Perplexity

Datasets

Yelp Open DatasetBig PatentDBpedia