Make tiny weighted training sets by clustering latents and decoding with diffusion — provably consistent.

Overview

Decision SnapshotReady For Pilot

DDOQ is simple to implement (encode→k-means→decode→weighted training), scales to ImageNet, and has theoretical guarantees; main requirements are a good pretrained latent diffusion prior and GPU time for synthesis.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Hong Ye Tan, Emma Slade

Links

Abstract / PDF / Data

Why It Matters For Business

You can cut data storage and training compute by replacing large datasets with a small set of weighted synthetic images decoded from latent clusters, while preserving accuracy when you use a good latent diffusion prior.

Who Should Care

ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

The paper recasts latent-space dataset distillation as optimal quantization. It proves that clustering a low-dimensional diffusion latent and decoding the cluster centers yields synthetic data that converges to the true distribution as points K increase (rate O(K^{-1/d})). It introduces DDOQ: encode images with a pretrained latent diffusion model, run k-means with automatically computed per-cluster weights, decode centers to images, and train using weighted samples and soft labels. Empirically on ImageNet-1K and subsets, DDOQ reduces latent Wasserstein-2 (~15–16%) vs prior latent clustering (D4M) and improves top-1 accuracy substantially when using a stronger DiT backbone (e.g., ResNet-18: 3

Problem Statement

Training modern models needs lots of images and compute. Dataset distillation tries to replace big datasets with a much smaller synthetic set. Existing bi-level distillation is costly and hard to scale. Disentangled (latent) methods work well in practice but lacked a formal consistency guarantee. This paper asks: can we justify and improve latent clustering approaches, and produce synthetic data that provably approximates the true data distribution when decoded through diffusion models?

Main Contribution

Theoretical link: show latent clustering = optimal quantization and prove pushforward consistency through diffusion (Theorem 1; Corollary 1).

Algorithm DDOQ: per-class latent k-means + automatically learned cluster weights, then decode via latent diffusion and train on weighted synthetic data.

Key Findings

Optimal quantization in latent space pushes forward to consistent approximations in image space.

NumbersConvergence rate O(K^{-1/d}) (Corollary 1).

Practical UseUse low-dimensional latents and increase K to systematically reduce synthetic-vs-real mismatch for training gradients.

Evidence RefTheorem 1; Corollary 1

Adding per-cluster weights cuts latent Wasserstein-2 error vs uniform barycenters.

NumbersAvg reduction ≈ -15.7% in W2 on ImageNet-1K classes.

Practical UseKeep cluster counts as weights when decoding and use them to weight training losses — it measurably improves fidelity with trivial extra cost.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Latent Wasserstein-2 (vs encoded training latents)	≈-15.7% average reduction	D4M (Wasserstein barycenter)	-15.7% (avg)	ImageNet-1K (example classes)	Table 1 shows per-class W2 and average reduction	Table 1
Accuracy	53.0% (DDOQ-DiT, IPC10)	DiT random init decoding: 39.6%	+13.4 percentage points	ImageNet-1K (IPC 10)	Table 4 comparison on ImageNet-1K	Table 4

What To Try In 7 Days

Encode a small labelled subset with a pretrained LDM encoder and run k-means per class (K = desired IPC).

Decode the K centers with your diffusion decoder and assign weights from cluster counts to each synthetic image.

Train a student model with soft labels and weighted loss; compare Top-1 accuracy vs baseline small subsets and monitor latent W2 distance.', 'If available, swap in a stronger laten

Optimization Features

System Optimization

constant memory w.r.t. IPC

Training Optimization

data-efficient trainingweighted-sample trainingsoft-label distillation

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

ImageNet-1K (public dataset)

Risks & Boundaries

Limitations

Method quality depends on the latent diffusion prior; poor priors hurt fidelity.

Weights can be sensitive to training hyperparameters and learning rate.

When Not To Use

You lack a reliable pretrained latent diffusion model or GPU resources to synthesize images.

You need exact, provenance-traceable real data for auditing or legal reasons.

Failure Modes

Generative prior can bias synthetic data away from true rare modes.

k-means can converge to local minima, producing suboptimal quantizers.

Core Entities

Models

DDOQ (this paper)D4MStable Diffusion v1.5 (LDM)DiT (Diffusion Transformer)ResNet-18ResNet-50ResNet-101Swin-TMobileNet-V2EfficientNet-B0

Metrics

AccuracyWasserstein-2 (latent)Inception ScoreFID

Datasets

ImageNet-1KImageNetteImageWoof

Benchmarks

AccuracyWasserstein-2 distanceIPC (images per class)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Optimal quantization in latent space pushes forward to consistent approximations in image space.

Adding per-cluster weights cuts latent Wasserstein-2 error vs uniform barycenters.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding