Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
You can cut data storage and training compute by replacing large datasets with a small set of weighted synthetic images decoded from latent clusters, while preserving accuracy when you use a good latent diffusion prior.
Summary TLDR
The paper recasts latent-space dataset distillation as optimal quantization. It proves that clustering a low-dimensional diffusion latent and decoding the cluster centers yields synthetic data that converges to the true distribution as points K increase (rate O(K^{-1/d})). It introduces DDOQ: encode images with a pretrained latent diffusion model, run k-means with automatically computed per-cluster weights, decode centers to images, and train using weighted samples and soft labels. Empirically on ImageNet-1K and subsets, DDOQ reduces latent Wasserstein-2 (~15–16%) vs prior latent clustering (D4M) and improves top-1 accuracy substantially when using a stronger DiT backbone (e.g., ResNet-18: 3
Problem Statement
Training modern models needs lots of images and compute. Dataset distillation tries to replace big datasets with a much smaller synthetic set. Existing bi-level distillation is costly and hard to scale. Disentangled (latent) methods work well in practice but lacked a formal consistency guarantee. This paper asks: can we justify and improve latent clustering approaches, and produce synthetic data that provably approximates the true data distribution when decoded through diffusion models?
Main Contribution
Theoretical link: show latent clustering = optimal quantization and prove pushforward consistency through diffusion (Theorem 1; Corollary 1).
Algorithm DDOQ: per-class latent k-means + automatically learned cluster weights, then decode via latent diffusion and train on weighted synthetic data.
Empirical gains on ImageNet-1K and subsets using same diffusion backbone; bigger gains with stronger DiT backbone and improved cross-architecture generalization.
Practical recipe with constant memory scaling in images-per-class (IPC) and a small extra cost for synthesis; includes an ablation on weighting strategies.
Key Findings
Optimal quantization in latent space pushes forward to consistent approximations in image space.
Adding per-cluster weights cuts latent Wasserstein-2 error vs uniform barycenters.
Stronger diffusion backbones amplify distillation gains at low IPC.
DDOQ improves final classification compared to prior disentangled methods.
Results
Latent Wasserstein-2 (vs encoded training latents)
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Encode a small labelled subset with a pretrained LDM encoder and run k-means per class (K = desired IPC).
Decode the K centers with your diffusion decoder and assign weights from cluster counts to each synthetic image.
Train a student model with soft labels and weighted loss; compare Top-1 accuracy vs baseline small subsets and monitor latent W2 distance.', 'If available, swap in a stronger laten
Optimization Features
System Optimization
- constant memory w.r.t. IPC
Training Optimization
- data-efficient training
- weighted-sample training
- soft-label distillation
Reproducibility
Data Urls
- ImageNet-1K (public dataset)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Method quality depends on the latent diffusion prior; poor priors hurt fidelity.
- Weights can be sensitive to training hyperparameters and learning rate.
- Theoretical bounds assume compact support and Lipschitz conditions that may not hold exactly in practice.
- Image synthesis adds non-trivial GPU time (synthesis reported as hours for ImageNet-1K).
When Not To Use
- You lack a reliable pretrained latent diffusion model or GPU resources to synthesize images.
- You need exact, provenance-traceable real data for auditing or legal reasons.
- Your task needs rare edge cases that a generative prior cannot capture.
Failure Modes
- Generative prior can bias synthetic data away from true rare modes.
- k-means can converge to local minima, producing suboptimal quantizers.
- Cluster weights may over/under-emphasize clusters and harm student training if not tuned.
Core Entities
Models
- DDOQ (this paper)
- D4M
- Stable Diffusion v1.5 (LDM)
- DiT (Diffusion Transformer)
- ResNet-18
- ResNet-50
- ResNet-101
- Swin-T
- MobileNet-V2
- EfficientNet-B0
Metrics
- Accuracy
- Wasserstein-2 (latent)
- Inception Score
- FID
Datasets
- ImageNet-1K
- ImageNette
- ImageWoof
Benchmarks
- Accuracy
- Wasserstein-2 distance
- IPC (images per class)

