Overview
DDOQ is simple to implement (encode→k-means→decode→weighted training), scales to ImageNet, and has theoretical guarantees; main requirements are a good pretrained latent diffusion prior and GPU time for synthesis.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
You can cut data storage and training compute by replacing large datasets with a small set of weighted synthetic images decoded from latent clusters, while preserving accuracy when you use a good latent diffusion prior.
Who Should Care
Summary TLDR
The paper recasts latent-space dataset distillation as optimal quantization. It proves that clustering a low-dimensional diffusion latent and decoding the cluster centers yields synthetic data that converges to the true distribution as points K increase (rate O(K^{-1/d})). It introduces DDOQ: encode images with a pretrained latent diffusion model, run k-means with automatically computed per-cluster weights, decode centers to images, and train using weighted samples and soft labels. Empirically on ImageNet-1K and subsets, DDOQ reduces latent Wasserstein-2 (~15–16%) vs prior latent clustering (D4M) and improves top-1 accuracy substantially when using a stronger DiT backbone (e.g., ResNet-18: 3
Problem Statement
Training modern models needs lots of images and compute. Dataset distillation tries to replace big datasets with a much smaller synthetic set. Existing bi-level distillation is costly and hard to scale. Disentangled (latent) methods work well in practice but lacked a formal consistency guarantee. This paper asks: can we justify and improve latent clustering approaches, and produce synthetic data that provably approximates the true data distribution when decoded through diffusion models?
Main Contribution
Theoretical link: show latent clustering = optimal quantization and prove pushforward consistency through diffusion (Theorem 1; Corollary 1).
Algorithm DDOQ: per-class latent k-means + automatically learned cluster weights, then decode via latent diffusion and train on weighted synthetic data.
Key Findings
Optimal quantization in latent space pushes forward to consistent approximations in image space.
Adding per-cluster weights cuts latent Wasserstein-2 error vs uniform barycenters.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Latent Wasserstein-2 (vs encoded training latents) | ≈-15.7% average reduction | D4M (Wasserstein barycenter) | -15.7% (avg) | ImageNet-1K (example classes) | Table 1 shows per-class W2 and average reduction | Table 1 |
| Accuracy | 53.0% (DDOQ-DiT, IPC10) | DiT random init decoding: 39.6% | +13.4 percentage points | ImageNet-1K (IPC 10) | Table 4 comparison on ImageNet-1K | Table 4 |
What To Try In 7 Days
Encode a small labelled subset with a pretrained LDM encoder and run k-means per class (K = desired IPC).
Decode the K centers with your diffusion decoder and assign weights from cluster counts to each synthetic image.
Train a student model with soft labels and weighted loss; compare Top-1 accuracy vs baseline small subsets and monitor latent W2 distance.', 'If available, swap in a stronger laten
Optimization Features
System Optimization
Training Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Method quality depends on the latent diffusion prior; poor priors hurt fidelity.
Weights can be sensitive to training hyperparameters and learning rate.
When Not To Use
You lack a reliable pretrained latent diffusion model or GPU resources to synthesize images.
You need exact, provenance-traceable real data for auditing or legal reasons.
Failure Modes
Generative prior can bias synthetic data away from true rare modes.
k-means can converge to local minima, producing suboptimal quantizers.

