Make tiny weighted training sets by clustering latents and decoding with diffusion — provably consistent.

January 13, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Hong Ye Tan, Emma Slade

Links

Abstract / PDF

Why It Matters For Business

You can cut data storage and training compute by replacing large datasets with a small set of weighted synthetic images decoded from latent clusters, while preserving accuracy when you use a good latent diffusion prior.

Summary TLDR

The paper recasts latent-space dataset distillation as optimal quantization. It proves that clustering a low-dimensional diffusion latent and decoding the cluster centers yields synthetic data that converges to the true distribution as points K increase (rate O(K^{-1/d})). It introduces DDOQ: encode images with a pretrained latent diffusion model, run k-means with automatically computed per-cluster weights, decode centers to images, and train using weighted samples and soft labels. Empirically on ImageNet-1K and subsets, DDOQ reduces latent Wasserstein-2 (~15–16%) vs prior latent clustering (D4M) and improves top-1 accuracy substantially when using a stronger DiT backbone (e.g., ResNet-18: 3

Problem Statement

Training modern models needs lots of images and compute. Dataset distillation tries to replace big datasets with a much smaller synthetic set. Existing bi-level distillation is costly and hard to scale. Disentangled (latent) methods work well in practice but lacked a formal consistency guarantee. This paper asks: can we justify and improve latent clustering approaches, and produce synthetic data that provably approximates the true data distribution when decoded through diffusion models?

Main Contribution

Theoretical link: show latent clustering = optimal quantization and prove pushforward consistency through diffusion (Theorem 1; Corollary 1).

Algorithm DDOQ: per-class latent k-means + automatically learned cluster weights, then decode via latent diffusion and train on weighted synthetic data.

Empirical gains on ImageNet-1K and subsets using same diffusion backbone; bigger gains with stronger DiT backbone and improved cross-architecture generalization.

Practical recipe with constant memory scaling in images-per-class (IPC) and a small extra cost for synthesis; includes an ablation on weighting strategies.

Key Findings

Optimal quantization in latent space pushes forward to consistent approximations in image space.

NumbersConvergence rate O(K^{-1/d}) (Corollary 1).

Adding per-cluster weights cuts latent Wasserstein-2 error vs uniform barycenters.

NumbersAvg reduction ≈ -15.7% in W2 on ImageNet-1K classes.

Stronger diffusion backbones amplify distillation gains at low IPC.

NumbersImageNet-1K ResNet-18 IPC10: DDOQ-DiT 53.0 vs DiT baseline 39.6 (≈+13.4 pp).

DDOQ improves final classification compared to prior disentangled methods.

NumbersReported 30% reduction in error gap using ResNet-101 at IPC 200 (text summary).

Results

Latent Wasserstein-2 (vs encoded training latents)

Value≈-15.7% average reduction

BaselineD4M (Wasserstein barycenter)

Accuracy

Value53.0% (DDOQ-DiT, IPC10)

BaselineDiT random init decoding: 39.6%

Accuracy

Value62.7% (DDOQ-DiT, IPC50)

BaselineDiT: 52.9%

Who Should Care

What To Try In 7 Days

Encode a small labelled subset with a pretrained LDM encoder and run k-means per class (K = desired IPC).

Decode the K centers with your diffusion decoder and assign weights from cluster counts to each synthetic image.

Train a student model with soft labels and weighted loss; compare Top-1 accuracy vs baseline small subsets and monitor latent W2 distance.', 'If available, swap in a stronger laten

Optimization Features

System Optimization

  • constant memory w.r.t. IPC

Training Optimization

  • data-efficient training
  • weighted-sample training
  • soft-label distillation

Reproducibility

Data Urls

  • ImageNet-1K (public dataset)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Method quality depends on the latent diffusion prior; poor priors hurt fidelity.
  • Weights can be sensitive to training hyperparameters and learning rate.
  • Theoretical bounds assume compact support and Lipschitz conditions that may not hold exactly in practice.
  • Image synthesis adds non-trivial GPU time (synthesis reported as hours for ImageNet-1K).

When Not To Use

  • You lack a reliable pretrained latent diffusion model or GPU resources to synthesize images.
  • You need exact, provenance-traceable real data for auditing or legal reasons.
  • Your task needs rare edge cases that a generative prior cannot capture.

Failure Modes

  • Generative prior can bias synthetic data away from true rare modes.
  • k-means can converge to local minima, producing suboptimal quantizers.
  • Cluster weights may over/under-emphasize clusters and harm student training if not tuned.

Core Entities

Models

  • DDOQ (this paper)
  • D4M
  • Stable Diffusion v1.5 (LDM)
  • DiT (Diffusion Transformer)
  • ResNet-18
  • ResNet-50
  • ResNet-101
  • Swin-T
  • MobileNet-V2
  • EfficientNet-B0

Metrics

  • Accuracy
  • Wasserstein-2 (latent)
  • Inception Score
  • FID

Datasets

  • ImageNet-1K
  • ImageNette
  • ImageWoof

Benchmarks

  • Accuracy
  • Wasserstein-2 distance
  • IPC (images per class)