DistilDP: use a DP-finetuned teacher to generate private synthetic text and distill a compact student without applying DP twice

March 1, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

2

Authors

James Flemings, Murali Annavaram

Links

Abstract / PDF

Why It Matters For Business

DistilDP lets you produce a smaller, private language model with better utility than privately fine-tuning the small model directly, reducing deployment cost while respecting strong DP budgets.

Summary TLDR

DistilDP trains a teacher LLM once with DP-SGD, uses that private teacher to generate synthetic text, and then trains a smaller student on the synthetic data while distilling the teacher's output distribution (soft labels). This avoids running DP-SGD on the student and reduces privacy budget splitting. On three datasets (Yelp, Big Patent, DBpedia) with a strong privacy setting (ε=2), DistilDP lowers student perplexity versus baselines (e.g., ~9 PPL gain on Big Patent). Important knobs: mix supervised and distillation losses (λ≈0.4), temperature t=1, and large synthetic datasets (50K→400K gave ~5 PPL gain). Hidden-state alignment helps further when teacher and student share compatible sizes.

Problem Statement

Compress large language models for deployment while preserving strong differential privacy. Training with DP and then compressing with standard knowledge distillation compounds utility loss. Running DP-SGD twice (teacher and student) wastes privacy budget and hurts utility. The need: a practical method that gets a small private student with less utility loss and manageable compute.

Main Contribution

DistilDP: a simple pipeline that DP-fine-tunes one teacher, generates DP synthetic text from that teacher, then trains a student on the synthetic data with distillation (soft labels) — avoiding a second DP-SGD run.

Empirical evidence that combining synthetic hard labels and teacher output distributions substantially improves student utility under strong privacy (ε=2), including ~9 PPL gain on Big Patent.

A practical ablation showing best practices (λ≈0.4, temperature t=1, more synthetic data helps) and an optional hidden-representation alignment step when architectures match.

Key Findings

DistilDP substantially reduces perplexity on Big Patent versus private fine-tuning baselines.

NumbersBig Patent: DistilDP PPL 32.43 vs DP-SGD student 41.8 (−9.37 PPL)

DistilDP improves student perplexity on Yelp compared to private finetuning.

NumbersYelp: DistilDP PPL 44.15 vs DP-SGD student 48.12 (−3.97 PPL)

Combining soft-label distillation and supervised loss is crucial; λ≈0.4 performs best and temperature t=1 is preferred.

Numbersλ sweep: best near 0.4; t ∈ {1,2,5} — t=1 gave best PPL

More synthetic data lowers perplexity; scaling synthetic samples from 50K to 400K improved PPL by about 5 points.

Numbers50K→400K synthetic samples: ~5 PPL improvement

Aligning hidden representations narrows the student-teacher gap when architectures match.

NumbersBig Patent: student PPL 37.17 with MSE loss vs teacher 31.41 (gap ≈6 PPL)

Results

Perplexity (Yelp test)

Value44.15 (DistilDP, DistilGPT2, ε=2)

Baseline48.12 (DP-SGD DistilGPT2, ε=2)

Perplexity (Big Patent test)

Value32.43 (DistilDP, DistilGPT2, ε=2)

Baseline41.8 (DP-SGD DistilGPT2, ε=2)

Perplexity (DBpedia test)

Value49.11 (DistilDP, DistilGPT2, ε=2)

Baseline60.81 (DP-SGD DistilGPT2, ε=2)

Who Should Care

What To Try In 7 Days

Fine-tune a pretrained teacher with DP-SGD (allocate privacy budget to teacher only) and generate ~100k+ synthetic examples using control codes.

Train a small student on the synthetic data and distill teacher soft labels with λ≈0.4 and temperature t=1.

If teacher and student share hidden dimensions, add a small MSE loss on last hidden states to squeeze extra performance.

Optimization Features

Infra Optimization

  • Training teacher with DP may require multi-GPU A100 class hardware

Model Optimization

  • Knowledge distillation to smaller model

System Optimization

  • Reduce extra DP-SGD run to lower memory and compute

Training Optimization

  • Apply DP-SGD once to teacher; avoid DP-SGD on student to save privacy budget

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Teacher must be trained with DP-SGD, which is computationally and memory intensive for large models.
  • Small classes (rare control codes) suffer under tight DP guarantees and synthetic generation may poorly represent them.
  • The approach assumes pretrained public models and control codes; public pretraining itself can leak private info.

When Not To Use

  • When you cannot afford the GPU/time to run DP-SGD on a large teacher.
  • When the dataset has many tiny classes that DP will distort heavily.
  • When you require formal privacy budget accounting that must include control-code distribution handling (authors ignored small control-code privacy loss).

Failure Modes

  • Student overfits to synthetic data if the synthetic distribution deviates from true private data.
  • Low-quality DP synthetic generation leads to worse student utility than private finetuning alone.
  • Control codes or their distribution can leak private categorical signals if not handled carefully.

Core Entities

Models

  • GPT2-Large
  • DistilGPT2
  • GPT2

Metrics

  • Perplexity

Datasets

  • Yelp Open Dataset
  • Big Patent
  • DBpedia