Send tasks as tiny label payloads: train clients from a shared image pool using <1 MB

February 26, 20267 min

Overview

Decision SnapshotReady For Pilot

Method is simple and relies on public datasets and standard compression; results are strong on many natural-image tasks but require clients to store a large reference set and need tuned filtering for far-OOD domains.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Elad Kimchi Shoshani, Leeyam Gabay, Yedid Hoshen

Links

Abstract / PDF / Data

Why It Matters For Business

If clients can store a shared unlabeled image pool, servers can deliver new classification tasks with tiny label-only payloads (<1 MB). This cuts recurring transfer costs drastically and enables operation over very low-bandwidth links.

Who Should Care

Summary TLDR

PLADA (Pseudo-Labels as Data) lets a server convey a classification task by sending only hard labels for images in a large, preloaded reference dataset (e.g., ImageNet-21K). Using energy-based pruning plus a class-preserving Safety-Net and standard compression (Zstd), PLADA often fits the task payload under 1 MB (often 85–206 KB at 1% keep) while retaining strong classification accuracy on many natural-image benchmarks. Far-OOD tasks (medical images) need different selection (high-energy) and show bigger accuracy drops. Method requires clients to store the reference image pool beforehand.

Problem Statement

Dataset servers must repeatedly send large training data to heterogeneous clients. Sending model weights is not always feasible. Existing dataset distillation struggles to scale to high-resolution data or to produce tiny payloads. We need a method that compresses the training signal by orders of magnitude while keeping client-side training effective under extreme bandwidth limits.

Main Contribution

PLADA: represent a task by sending only hard pseudo-labels for a preloaded reference image pool, eliminating pixel transfer.

Pruning + Safety-Net: use energy-based OOD scores to keep a tiny fraction (1%–10%) of reference images and a class quota to avoid class collapse under extreme compression.

Key Findings

Task transfer with payloads well below 1 MB is practical.

NumbersZstd-compressed payload at 1% keep: 85206 KB (Table 4)

Practical UseIf clients preload a large unlabeled image set, servers can send task labels instead of images and fit extreme links (deep-sea, rover).

Evidence RefTable 4

Aggressive pruning improves or preserves accuracy on many natural-image tasks.

NumbersExample: Caltech-101 accuracy 1% energy-filtered = 79.84% vs 92.74% full (Table 1)

Practical UseSend only the top 1%–10% low-energy reference images to cut bandwidth and often improve the client's final accuracy.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Compressed payload size (Zstd)85206 KB at 1% keep (ImageNet-21K reference)Aggregate (Table 4 ranges)Table 4: Zstd sizes for p=1%Table 4
AccuracyCaltech-101: 79.84%Full reference (100%): 92.74%-12.90 ppCaltech-101 (Table 1)Table 1, 1% vs 100%Table 1

What To Try In 7 Days

Preload a moderate reference image pool on test clients (ImageNet-like or domain-specific).

Implement teacher-side pseudo-labeling and energy-based ranking on one target task.

Compress the selected indices+labels with delta/RLE and Zstd and measure payload vs local training accuracy.

Agent Features

Tool Use
ZstdRLEHuffman coding

Optimization Features

Infra Optimization
reduce bandwidth by replacing pixel transfer with label payload
System Optimization
delta-index encodingbitmap vs index choice
Training Optimization
pruning dataset with energy-based OOD scoresSafety-Net class quota to avoid collapseimportance weighting (discussed)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

ImageNet-21K and ImageNet-1K (public references); target datasets listed in paper (public sources)

Risks & Boundaries

Limitations

Requires clients to store a large unlabeled reference dataset locally.

Works only for classification tasks as evaluated; regression/generative tasks are not handled yet.

When Not To Use

Clients cannot store the reference image pool due to storage or privacy constraints.

Tasks are regression or generative and cannot be represented by hard labels alone.

Failure Modes

Class collapse when extreme pruning removes rare classes unless Safety-Net is used.

Spurious label mappings for far-OOD tasks causing student collapse (medical datasets).

Core Entities

Models

ConvNeXt-V2-Tiny (teacher)ResNet-18 (student)

Metrics

Accuracypayload size (KB/MB)keep rate p (%)

Datasets

ImageNet-21K (reference)ImageNet-1K (reference)Caltech-101CIFAR-10CUB-200DTDFGVC-AircraftFood-101Oxford-Flowers-102Oxford-IIIT-PetPlaces365RESISC45BloodMNISTDermaMNISTRetinaMNISTNCT-CRC-HE-100K

Benchmarks

14 classification datasets (10 natural + 4 medical OOD)

Context Entities

Models

linear probe baselineINT8 quantized ResNet-18 (baseline variants)

Metrics

Accuracyintersection fraction (<1% overlap)

Datasets

reference vs target split analysis for leakage checks