Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
If clients can store a shared unlabeled image pool, servers can deliver new classification tasks with tiny label-only payloads (<1 MB). This cuts recurring transfer costs drastically and enables operation over very low-bandwidth links.
Summary TLDR
PLADA (Pseudo-Labels as Data) lets a server convey a classification task by sending only hard labels for images in a large, preloaded reference dataset (e.g., ImageNet-21K). Using energy-based pruning plus a class-preserving Safety-Net and standard compression (Zstd), PLADA often fits the task payload under 1 MB (often 85–206 KB at 1% keep) while retaining strong classification accuracy on many natural-image benchmarks. Far-OOD tasks (medical images) need different selection (high-energy) and show bigger accuracy drops. Method requires clients to store the reference image pool beforehand.
Problem Statement
Dataset servers must repeatedly send large training data to heterogeneous clients. Sending model weights is not always feasible. Existing dataset distillation struggles to scale to high-resolution data or to produce tiny payloads. We need a method that compresses the training signal by orders of magnitude while keeping client-side training effective under extreme bandwidth limits.
Main Contribution
PLADA: represent a task by sending only hard pseudo-labels for a preloaded reference image pool, eliminating pixel transfer.
Pruning + Safety-Net: use energy-based OOD scores to keep a tiny fraction (1%–10%) of reference images and a class quota to avoid class collapse under extreme compression.
Compression analysis: combine index-delta/RLE and Zstd to reduce 1% keep payloads to tens-to-hundreds of KB while preserving accuracy on diverse datasets.
Key Findings
Task transfer with payloads well below 1 MB is practical.
Aggressive pruning improves or preserves accuracy on many natural-image tasks.
Safety-Net prevents class collapse and substantially raises accuracy on imbalanced datasets.
Far out-of-distribution (medical) tasks need inverted selection rules.
Results
Compressed payload size (Zstd)
Accuracy
Safety-Net effect (1% keep)
Medical OOD selection reversal
Who Should Care
What To Try In 7 Days
Preload a moderate reference image pool on test clients (ImageNet-like or domain-specific).
Implement teacher-side pseudo-labeling and energy-based ranking on one target task.
Compress the selected indices+labels with delta/RLE and Zstd and measure payload vs local training accuracy.
Agent Features
Tool Use
- Zstd
- RLE
- Huffman coding
Optimization Features
Infra Optimization
- reduce bandwidth by replacing pixel transfer with label payload
System Optimization
- delta-index encoding
- bitmap vs index choice
Training Optimization
- pruning dataset with energy-based OOD scores
- Safety-Net class quota to avoid collapse
- importance weighting (discussed)
Reproducibility
Data Urls
- ImageNet-21K and ImageNet-1K (public references); target datasets listed in paper (public sources)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires clients to store a large unlabeled reference dataset locally.
- Works only for classification tasks as evaluated; regression/generative tasks are not handled yet.
- Selection hyperparameters and Safety-Net quotas must be tuned per domain.
- Far-OOD domains (medical) can fail under default low-energy filtering and need alternate heuristics.
When Not To Use
- Clients cannot store the reference image pool due to storage or privacy constraints.
- Tasks are regression or generative and cannot be represented by hard labels alone.
- Reference dataset has near-zero semantic overlap with target domain and no viable selection strategy exists.
Failure Modes
- Class collapse when extreme pruning removes rare classes unless Safety-Net is used.
- Spurious label mappings for far-OOD tasks causing student collapse (medical datasets).
- Payload dominated by index bitmap at moderate keep rates unless delta/RLE used.
Core Entities
Models
- ConvNeXt-V2-Tiny (teacher)
- ResNet-18 (student)
Metrics
- Accuracy
- payload size (KB/MB)
- keep rate p (%)
Datasets
- ImageNet-21K (reference)
- ImageNet-1K (reference)
- Caltech-101
- CIFAR-10
- CUB-200
- DTD
- FGVC-Aircraft
- Food-101
- Oxford-Flowers-102
- Oxford-IIIT-Pet
- Places365
- RESISC45
- BloodMNIST
- DermaMNIST
- RetinaMNIST
- NCT-CRC-HE-100K
Benchmarks
- 14 classification datasets (10 natural + 4 medical OOD)
Context Entities
Models
- linear probe baseline
- INT8 quantized ResNet-18 (baseline variants)
Metrics
- Accuracy
- intersection fraction (<1% overlap)
Datasets
- reference vs target split analysis for leakage checks

