Cut training cost for vision transformers by combining attention-based data selection and two-step sparsity pruning.

May 1, 20247 min

Overview

Production Readiness

0.2

Novelty Score

0.5

Cost Impact Score

0.5

Citation Count

0

Authors

Ojasw Upadhyay

Links

Abstract / PDF

Why It Matters For Business

You can potentially cut training time and compute by pruning modestly (e.g., 30%) and training on attention-selected data. This reduces iteration time for product experiments and lowers cloud/GPU costs if accuracy loss is acceptable.

Summary TLDR

LOTUS proposes picking informative image patches using attention (a "data lottery ticket") and applying two-stage sparsity pruning (magnitude pruning + Instant Soup style mask) to speed up training of a pre-trained Vision Transformer on CIFAR-10. Key empirical signals: a one-shot 30% magnitude prune keeps ~79% accuracy, fine-tuning on selected patches converges very fast (near state-of-the-art by epoch 5), but the Instant Sparse Soup Pruning (ISSP) variant caused large accuracy drops (~50%) in the reported runs.

Problem Statement

Vision transformers give strong results but cost a lot of compute to train. The paper asks: can we reduce training time and compute by (1) training on smaller, attention-selected subsets of data and (2) pruning redundant weights, while keeping accuracy?

Main Contribution

LOTUS workflow: combine attention-based data lottery tickets with two-step pruning to speed ViT training.

Data lottery ticket method: pick important image patches using attention maps and drop low-attention patches (example: 10% patches removed).

Two-step pruning: one-shot magnitude pruning to find an "essential sparsity" level, then Instant Soup-style merged masks (ISSP) using masks unioned across sparsities and a denoised mask from a model trained on 10% data.

Empirical signals on CIFAR-10 with a pre-trained ViT: 30% one-shot magnitude pruning retained ~79% accuracy; fine-tuning on lottery data converged rapidly (near SOTA by epoch 5); ISSP runs in this paper produced large accuracy drops (~50%) and need follow-up.

Key Findings

One-shot magnitude pruning at 30% sparsity kept model accuracy high.

NumbersAccuracy ≈ 79% at 30% sparsity (CIFAR-10)

Fine-tuning on attention-selected data patches converged much faster.

NumbersReached close to SOTA by epoch 5 (after fine-tuning on lottery data)

Instant Sparse Soup Pruning (ISSP) runs gave a large accuracy drop in these experiments.

NumbersAccuracy ≈ 50% after essential sparsity pruning + ISP (reported)

Attention maps can produce data-level lottery tickets by removing low-attention patches.

NumbersExample shown with 10% of patches removed

Results

Accuracy

Value≈79%

epochs to near-SOTA

Value≈5 epochs

Accuracy

Value≈50%

data removed

Value10% patches removed (example)

Who Should Care

What To Try In 7 Days

Take a pre-trained ViT and run one-shot magnitude pruning at 30%; measure validation accuracy.

Compute attention maps on a small holdout and drop lowest 10% patches; fine-tune on remaining patches and monitor validation by epoch 5.

If accuracy collapses, avoid the Instant Sparse Soup pipeline until you validate mask creation using >10% data.

Optimization Features

Model Optimization

  • magnitude pruning (one-shot)
  • Instant Soup style mask merging (ISP/ISSP)
  • two-step pruning: essential sparsity then ISP

Training Optimization

  • data lottery tickets via attention-based patch selection
  • fine-tune pruned model on reduced-data subset to reduce epochs

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Experiments limited to a pre-trained ViT on CIFAR-10 (small dataset).
  • ISSP mask creation used only 10% of data—paper notes this likely harmed results.
  • No code, hyperparameters, or training time numbers provided in text.
  • Claims framed qualitatively (e.g., 'close to SOTA') without numeric baselines in several places.

When Not To Use

  • Do not deploy the reported ISSP pipeline at scale until masks and pruning settings are validated.
  • Avoid aggressive pruning (>50%) without careful validation—the paper reports accuracy collapse.

Failure Modes

  • Aggressive pruning removes important weights and causes large accuracy drops (observed ~50% accuracy).
  • Denoised mask built from too little data (10%) may be unrepresentative and harm performance.
  • Attention sink (first token dominating) unless normalized—paper normalizes first token to mean to fix this.

Core Entities

Models

  • Vision Transformer (pre-trained ViT)

Metrics

  • Accuracy
  • eval loss
  • epochs to converge
  • sparsity level (%)

Datasets

  • CIFAR-10

Context Entities

Models

  • pretrained transformer

Datasets

  • small-image benchmarks (CIFAR-10 used here)