DistilDP: use a DP-finetuned teacher to generate private synthetic text and distill a compact student without applying DP twice

Overview

Decision SnapshotNeeds Validation

Method is practical and improves private student utility on three datasets, but experiments are single-run and require costly DP training of a large teacher, so expect engineering effort and further validation before wide production use.

Citations2

Evidence Strength0.60

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

James Flemings, Murali Annavaram

Links

Abstract / PDF / Code / Data

Why It Matters For Business

DistilDP lets you produce a smaller, private language model with better utility than privately fine-tuning the small model directly, reducing deployment cost while respecting strong DP budgets.

Who Should Care

ML Engineer Data Scientist Engineering Lead

Summary TLDR

DistilDP trains a teacher LLM once with DP-SGD, uses that private teacher to generate synthetic text, and then trains a smaller student on the synthetic data while distilling the teacher's output distribution (soft labels). This avoids running DP-SGD on the student and reduces privacy budget splitting. On three datasets (Yelp, Big Patent, DBpedia) with a strong privacy setting (ε=2), DistilDP lowers student perplexity versus baselines (e.g., ~9 PPL gain on Big Patent). Important knobs: mix supervised and distillation losses (λ≈0.4), temperature t=1, and large synthetic datasets (50K→400K gave ~5 PPL gain). Hidden-state alignment helps further when teacher and student share compatible sizes.

Problem Statement

Compress large language models for deployment while preserving strong differential privacy. Training with DP and then compressing with standard knowledge distillation compounds utility loss. Running DP-SGD twice (teacher and student) wastes privacy budget and hurts utility. The need: a practical method that gets a small private student with less utility loss and manageable compute.

Main Contribution

DistilDP: a simple pipeline that DP-fine-tunes one teacher, generates DP synthetic text from that teacher, then trains a student on the synthetic data with distillation (soft labels) — avoiding a second DP-SGD run.

Empirical evidence that combining synthetic hard labels and teacher output distributions substantially improves student utility under strong privacy (ε=2), including ~9 PPL gain on Big Patent.

Key Findings

DistilDP substantially reduces perplexity on Big Patent versus private fine-tuning baselines.

NumbersBig Patent: DistilDP PPL 32.43 vs DP-SGD student 41.8 (−9.37 PPL)

Practical UseIf you need a private compact summarization model, use a DP-finetuned teacher to generate synthetic data and distill its output distribution to the student to recover several points of perplexity under ε=2.

Evidence RefTable 2

DistilDP improves student perplexity on Yelp compared to private finetuning.

NumbersYelp: DistilDP PPL 44.15 vs DP-SGD student 48.12 (−3.97 PPL)

Practical UseExpect modest but consistent gains on large review-like datasets by using DistilDP instead of privately fine-tuning the small model directly.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Perplexity (Yelp test)	44.15 (DistilDP, DistilGPT2, ε=2)	48.12 (DP-SGD DistilGPT2, ε=2)	−3.97	Yelp test	Table 2 (reported PPLs)	Table 2
Perplexity (Big Patent test)	32.43 (DistilDP, DistilGPT2, ε=2)	41.8 (DP-SGD DistilGPT2, ε=2)	−9.37	Big Patent test	Table 2 (reported PPLs)	Table 2

What To Try In 7 Days

Fine-tune a pretrained teacher with DP-SGD (allocate privacy budget to teacher only) and generate ~100k+ synthetic examples using control codes.

Train a small student on the synthetic data and distill teacher soft labels with λ≈0.4 and temperature t=1.

If teacher and student share hidden dimensions, add a small MSE loss on last hidden states to squeeze extra performance.

Optimization Features

Infra Optimization

Training teacher with DP may require multi-GPU A100 class hardware

Model Optimization

Knowledge distillation to smaller model

System Optimization

Reduce extra DP-SGD run to lower memory and compute

Training Optimization

Apply DP-SGD once to teacher; avoid DP-SGD on student to save privacy budget

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/james-flemings/dp_compress

Data URLs

https://www.yelp.com/datasetBig Patent (Sharma et al., 2019)DBpedia (Zhang et al., 2015)

Risks & Boundaries

Limitations

Teacher must be trained with DP-SGD, which is computationally and memory intensive for large models.

Small classes (rare control codes) suffer under tight DP guarantees and synthetic generation may poorly represent them.

When Not To Use

When you cannot afford the GPU/time to run DP-SGD on a large teacher.

When the dataset has many tiny classes that DP will distort heavily.

Failure Modes

Student overfits to synthetic data if the synthetic distribution deviates from true private data.

Low-quality DP synthetic generation leads to worse student utility than private finetuning alone.

Core Entities

Models

GPT2-LargeDistilGPT2GPT2

Metrics

Perplexity

Datasets

Yelp Open DatasetBig PatentDBpedia

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

DistilDP substantially reduces perplexity on Big Patent versus private fine-tuning baselines.

DistilDP improves student perplexity on Yelp compared to private finetuning.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

ProUtt: LLM-driven synthesis of preference-labelled intent reasoning to predict users' next utterance

Key finding

Use multiple LLMs together to auto-generate preference datasets and improve model responses

Key finding

Train detectors by teaching models with high-quality fake answers

Key finding

TarGEN: generate balanced, diverse labeled NLP datasets from task descriptions (no seed examples)

Key finding