Overview
Method is simple and relies on public datasets and standard compression; results are strong on many natural-image tasks but require clients to store a large reference set and need tuned filtering for far-OOD domains.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
If clients can store a shared unlabeled image pool, servers can deliver new classification tasks with tiny label-only payloads (<1 MB). This cuts recurring transfer costs drastically and enables operation over very low-bandwidth links.
Who Should Care
Summary TLDR
PLADA (Pseudo-Labels as Data) lets a server convey a classification task by sending only hard labels for images in a large, preloaded reference dataset (e.g., ImageNet-21K). Using energy-based pruning plus a class-preserving Safety-Net and standard compression (Zstd), PLADA often fits the task payload under 1 MB (often 85–206 KB at 1% keep) while retaining strong classification accuracy on many natural-image benchmarks. Far-OOD tasks (medical images) need different selection (high-energy) and show bigger accuracy drops. Method requires clients to store the reference image pool beforehand.
Problem Statement
Dataset servers must repeatedly send large training data to heterogeneous clients. Sending model weights is not always feasible. Existing dataset distillation struggles to scale to high-resolution data or to produce tiny payloads. We need a method that compresses the training signal by orders of magnitude while keeping client-side training effective under extreme bandwidth limits.
Main Contribution
PLADA: represent a task by sending only hard pseudo-labels for a preloaded reference image pool, eliminating pixel transfer.
Pruning + Safety-Net: use energy-based OOD scores to keep a tiny fraction (1%–10%) of reference images and a class quota to avoid class collapse under extreme compression.
Key Findings
Task transfer with payloads well below 1 MB is practical.
Aggressive pruning improves or preserves accuracy on many natural-image tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Compressed payload size (Zstd) | 85–206 KB at 1% keep (ImageNet-21K reference) | — | — | Aggregate (Table 4 ranges) | Table 4: Zstd sizes for p=1% | Table 4 |
| Accuracy | Caltech-101: 79.84% | Full reference (100%): 92.74% | -12.90 pp | Caltech-101 (Table 1) | Table 1, 1% vs 100% | Table 1 |
What To Try In 7 Days
Preload a moderate reference image pool on test clients (ImageNet-like or domain-specific).
Implement teacher-side pseudo-labeling and energy-based ranking on one target task.
Compress the selected indices+labels with delta/RLE and Zstd and measure payload vs local training accuracy.
Agent Features
Tool Use
Optimization Features
Infra Optimization
System Optimization
Training Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Requires clients to store a large unlabeled reference dataset locally.
Works only for classification tasks as evaluated; regression/generative tasks are not handled yet.
When Not To Use
Clients cannot store the reference image pool due to storage or privacy constraints.
Tasks are regression or generative and cannot be represented by hard labels alone.
Failure Modes
Class collapse when extreme pruning removes rare classes unless Safety-Net is used.
Spurious label mappings for far-OOD tasks causing student collapse (medical datasets).

