Cut training cost for vision transformers by combining attention-based data selection and two-step sparsity pruning.

Overview

Decision SnapshotNeeds Validation

Small-scale experiments on CIFAR-10 and mixed results lower readiness. The idea of combining attention-based data selection with pruning is practical, but ISSP runs in this paper failed, so more validation is needed at scale.

Citations0

Evidence Strength0.30

Confidence0.60

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 20%

Novelty: 50%

Authors

Ojasw Upadhyay

Links

Abstract / PDF / Data

Why It Matters For Business

You can potentially cut training time and compute by pruning modestly (e.g., 30%) and training on attention-selected data. This reduces iteration time for product experiments and lowers cloud/GPU costs if accuracy loss is acceptable.

Who Should Care

ML Engineer Data Scientist Engineering Lead Founder

Summary TLDR

LOTUS proposes picking informative image patches using attention (a "data lottery ticket") and applying two-stage sparsity pruning (magnitude pruning + Instant Soup style mask) to speed up training of a pre-trained Vision Transformer on CIFAR-10. Key empirical signals: a one-shot 30% magnitude prune keeps ~79% accuracy, fine-tuning on selected patches converges very fast (near state-of-the-art by epoch 5), but the Instant Sparse Soup Pruning (ISSP) variant caused large accuracy drops (~50%) in the reported runs.

Problem Statement

Vision transformers give strong results but cost a lot of compute to train. The paper asks: can we reduce training time and compute by (1) training on smaller, attention-selected subsets of data and (2) pruning redundant weights, while keeping accuracy?

Main Contribution

LOTUS workflow: combine attention-based data lottery tickets with two-step pruning to speed ViT training.

Data lottery ticket method: pick important image patches using attention maps and drop low-attention patches (example: 10% patches removed).

Key Findings

One-shot magnitude pruning at 30% sparsity kept model accuracy high.

NumbersAccuracy ≈ 79% at 30% sparsity (CIFAR-10)

Practical UseStart with a modest one-shot magnitude prune (~30%) to cut parameters with small accuracy loss; test this setting as a baseline before aggressive pruning.

Evidence RefFigure 2; Results section: 'model pruned at 30% sparsity maintained a high ... ~

Fine-tuning on attention-selected data patches converged much faster.

NumbersReached close to SOTA by epoch 5 (after fine-tuning on lottery data)

Practical UseTry training on attention-selected subsets to reduce epochs and wall-clock training time; monitor validation by epoch 5 for early stopping.

Evidence RefResults section; Figure 4 (accuracy and loss plots)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	≈79%	—	—	CIFAR-10, one-shot 30% sparsity	Model pruned at 30% sparsity maintained ~79% accuracy (minimal degradation reported).	Results section; Figure 2
epochs to near-SOTA	≈5 epochs	—	—	fine-tuning on lottery data (CIFAR-10)	Fine-tuning on lottery data reached close to SOTA by the fifth epoch.	Results section; Figure 4

What To Try In 7 Days

Take a pre-trained ViT and run one-shot magnitude pruning at 30%; measure validation accuracy.

Compute attention maps on a small holdout and drop lowest 10% patches; fine-tune on remaining patches and monitor validation by epoch 5.

If accuracy collapses, avoid the Instant Sparse Soup pipeline until you validate mask creation using >10% data.

Optimization Features

Model Optimization

magnitude pruning (one-shot)Instant Soup style mask merging (ISP/ISSP)two-step pruning: essential sparsity then ISP

Training Optimization

data lottery tickets via attention-based patch selectionfine-tune pruned model on reduced-data subset to reduce epochs

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

https://www.cs.toronto.edu/~kriz/cifar.html (CIFAR-10)

Risks & Boundaries

Limitations

Experiments limited to a pre-trained ViT on CIFAR-10 (small dataset).

ISSP mask creation used only 10% of data—paper notes this likely harmed results.

When Not To Use

Do not deploy the reported ISSP pipeline at scale until masks and pruning settings are validated.

Avoid aggressive pruning (>50%) without careful validation—the paper reports accuracy collapse.

Failure Modes

Aggressive pruning removes important weights and causes large accuracy drops (observed ~50% accuracy).

Denoised mask built from too little data (10%) may be unrepresentative and harm performance.

Core Entities

Models

Vision Transformer (pre-trained ViT)

Metrics

Accuracyeval lossepochs to convergesparsity level (%)

Datasets

CIFAR-10

Context Entities

Models

pretrained transformer

Datasets

small-image benchmarks (CIFAR-10 used here)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

One-shot magnitude pruning at 30% sparsity kept model accuracy high.

Fine-tuning on attention-selected data patches converged much faster.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Datasets

You May Also Want to Read

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding

Smaller, faster NLLB-based models for 15 African language pairs, with released data and code

Key finding

Compression can preserve or break LLM trust: 4-bit quantization often keeps or even improves ethics/fairness, pruning and 3-bit quantization

Key finding

Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

Key finding