Make tiny weighted training sets by clustering latents and decoding with diffusion — provably consistent.

January 13, 20257 min

Overview

Decision SnapshotReady For Pilot

DDOQ is simple to implement (encode→k-means→decode→weighted training), scales to ImageNet, and has theoretical guarantees; main requirements are a good pretrained latent diffusion prior and GPU time for synthesis.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Hong Ye Tan, Emma Slade

Links

Abstract / PDF / Data

Why It Matters For Business

You can cut data storage and training compute by replacing large datasets with a small set of weighted synthetic images decoded from latent clusters, while preserving accuracy when you use a good latent diffusion prior.

Who Should Care

Summary TLDR

The paper recasts latent-space dataset distillation as optimal quantization. It proves that clustering a low-dimensional diffusion latent and decoding the cluster centers yields synthetic data that converges to the true distribution as points K increase (rate O(K^{-1/d})). It introduces DDOQ: encode images with a pretrained latent diffusion model, run k-means with automatically computed per-cluster weights, decode centers to images, and train using weighted samples and soft labels. Empirically on ImageNet-1K and subsets, DDOQ reduces latent Wasserstein-2 (~15–16%) vs prior latent clustering (D4M) and improves top-1 accuracy substantially when using a stronger DiT backbone (e.g., ResNet-18: 3

Problem Statement

Training modern models needs lots of images and compute. Dataset distillation tries to replace big datasets with a much smaller synthetic set. Existing bi-level distillation is costly and hard to scale. Disentangled (latent) methods work well in practice but lacked a formal consistency guarantee. This paper asks: can we justify and improve latent clustering approaches, and produce synthetic data that provably approximates the true data distribution when decoded through diffusion models?

Main Contribution

Theoretical link: show latent clustering = optimal quantization and prove pushforward consistency through diffusion (Theorem 1; Corollary 1).

Algorithm DDOQ: per-class latent k-means + automatically learned cluster weights, then decode via latent diffusion and train on weighted synthetic data.

Key Findings

Optimal quantization in latent space pushes forward to consistent approximations in image space.

NumbersConvergence rate O(K^{-1/d}) (Corollary 1).

Practical UseUse low-dimensional latents and increase K to systematically reduce synthetic-vs-real mismatch for training gradients.

Evidence RefTheorem 1; Corollary 1

Adding per-cluster weights cuts latent Wasserstein-2 error vs uniform barycenters.

NumbersAvg reduction ≈ -15.7% in W2 on ImageNet-1K classes.

Practical UseKeep cluster counts as weights when decoding and use them to weight training losses — it measurably improves fidelity with trivial extra cost.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Latent Wasserstein-2 (vs encoded training latents)≈-15.7% average reductionD4M (Wasserstein barycenter)-15.7% (avg)ImageNet-1K (example classes)Table 1 shows per-class W2 and average reductionTable 1
Accuracy53.0% (DDOQ-DiT, IPC10)DiT random init decoding: 39.6%+13.4 percentage pointsImageNet-1K (IPC 10)Table 4 comparison on ImageNet-1KTable 4

What To Try In 7 Days

Encode a small labelled subset with a pretrained LDM encoder and run k-means per class (K = desired IPC).

Decode the K centers with your diffusion decoder and assign weights from cluster counts to each synthetic image.

Train a student model with soft labels and weighted loss; compare Top-1 accuracy vs baseline small subsets and monitor latent W2 distance.', 'If available, swap in a stronger laten

Optimization Features

System Optimization
constant memory w.r.t. IPC
Training Optimization
data-efficient trainingweighted-sample trainingsoft-label distillation

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

ImageNet-1K (public dataset)

Risks & Boundaries

Limitations

Method quality depends on the latent diffusion prior; poor priors hurt fidelity.

Weights can be sensitive to training hyperparameters and learning rate.

When Not To Use

You lack a reliable pretrained latent diffusion model or GPU resources to synthesize images.

You need exact, provenance-traceable real data for auditing or legal reasons.

Failure Modes

Generative prior can bias synthetic data away from true rare modes.

k-means can converge to local minima, producing suboptimal quantizers.

Core Entities

Models

DDOQ (this paper)D4MStable Diffusion v1.5 (LDM)DiT (Diffusion Transformer)ResNet-18ResNet-50ResNet-101Swin-TMobileNet-V2EfficientNet-B0

Metrics

AccuracyWasserstein-2 (latent)Inception ScoreFID

Datasets

ImageNet-1KImageNetteImageWoof

Benchmarks

AccuracyWasserstein-2 distanceIPC (images per class)