Cut training cost for vision transformers by combining attention-based data selection and two-step sparsity pruning.

May 1, 20247 min

Overview

Decision SnapshotNeeds Validation

Small-scale experiments on CIFAR-10 and mixed results lower readiness. The idea of combining attention-based data selection with pruning is practical, but ISSP runs in this paper failed, so more validation is needed at scale.

Citations0

Evidence Strength0.30

Confidence0.60

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 20%

Novelty: 50%

Authors

Ojasw Upadhyay

Links

Abstract / PDF / Data

Why It Matters For Business

You can potentially cut training time and compute by pruning modestly (e.g., 30%) and training on attention-selected data. This reduces iteration time for product experiments and lowers cloud/GPU costs if accuracy loss is acceptable.

Who Should Care

Summary TLDR

LOTUS proposes picking informative image patches using attention (a "data lottery ticket") and applying two-stage sparsity pruning (magnitude pruning + Instant Soup style mask) to speed up training of a pre-trained Vision Transformer on CIFAR-10. Key empirical signals: a one-shot 30% magnitude prune keeps ~79% accuracy, fine-tuning on selected patches converges very fast (near state-of-the-art by epoch 5), but the Instant Sparse Soup Pruning (ISSP) variant caused large accuracy drops (~50%) in the reported runs.

Problem Statement

Vision transformers give strong results but cost a lot of compute to train. The paper asks: can we reduce training time and compute by (1) training on smaller, attention-selected subsets of data and (2) pruning redundant weights, while keeping accuracy?

Main Contribution

LOTUS workflow: combine attention-based data lottery tickets with two-step pruning to speed ViT training.

Data lottery ticket method: pick important image patches using attention maps and drop low-attention patches (example: 10% patches removed).

Key Findings

One-shot magnitude pruning at 30% sparsity kept model accuracy high.

NumbersAccuracy ≈ 79% at 30% sparsity (CIFAR-10)

Practical UseStart with a modest one-shot magnitude prune (~30%) to cut parameters with small accuracy loss; test this setting as a baseline before aggressive pruning.

Evidence RefFigure 2; Results section: 'model pruned at 30% sparsity maintained a high ... ~

Fine-tuning on attention-selected data patches converged much faster.

NumbersReached close to SOTA by epoch 5 (after fine-tuning on lottery data)

Practical UseTry training on attention-selected subsets to reduce epochs and wall-clock training time; monitor validation by epoch 5 for early stopping.

Evidence RefResults section; Figure 4 (accuracy and loss plots)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy≈79%CIFAR-10, one-shot 30% sparsityModel pruned at 30% sparsity maintained ~79% accuracy (minimal degradation reported).Results section; Figure 2
epochs to near-SOTA≈5 epochsfine-tuning on lottery data (CIFAR-10)Fine-tuning on lottery data reached close to SOTA by the fifth epoch.Results section; Figure 4

What To Try In 7 Days

Take a pre-trained ViT and run one-shot magnitude pruning at 30%; measure validation accuracy.

Compute attention maps on a small holdout and drop lowest 10% patches; fine-tune on remaining patches and monitor validation by epoch 5.

If accuracy collapses, avoid the Instant Sparse Soup pipeline until you validate mask creation using >10% data.

Optimization Features

Model Optimization
magnitude pruning (one-shot)Instant Soup style mask merging (ISP/ISSP)two-step pruning: essential sparsity then ISP
Training Optimization
data lottery tickets via attention-based patch selectionfine-tune pruned model on reduced-data subset to reduce epochs

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Experiments limited to a pre-trained ViT on CIFAR-10 (small dataset).

ISSP mask creation used only 10% of data—paper notes this likely harmed results.

When Not To Use

Do not deploy the reported ISSP pipeline at scale until masks and pruning settings are validated.

Avoid aggressive pruning (>50%) without careful validation—the paper reports accuracy collapse.

Failure Modes

Aggressive pruning removes important weights and causes large accuracy drops (observed ~50% accuracy).

Denoised mask built from too little data (10%) may be unrepresentative and harm performance.

Core Entities

Models

Vision Transformer (pre-trained ViT)

Metrics

Accuracyeval lossepochs to convergesparsity level (%)

Datasets

CIFAR-10

Context Entities

Models

pretrained transformer

Datasets

small-image benchmarks (CIFAR-10 used here)