Pruning Transformers for time-series: big FLOP drops but small real speedups; fine-tune and right-size models

December 17, 20247 min

Overview

Production Readiness

0.5

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

1

Authors

Nicholas Kiefer, Arvid Weyrauch, Muhammed Öz, Achim Streit, Markus Götz, Charlotte Debus

Links

Abstract / PDF

Why It Matters For Business

Pruning cuts model size and theoretical compute but doesn't guarantee runtime speedups; measuring on your hardware and considering smaller architectures first saves cost and deployment time.

Summary TLDR

This paper benchmarks unstructured (magnitude) and structured (DepGraph) pruning on five Transformer-based multivariate time-series models across standard datasets. Key takeaways: you can usually prune ~50% of weights with little loss; Autoformer and FEDformer tolerate extreme pruning (down to ~1% of original params) on evaluated data; structured pruning reduces FLOPs strongly (up to ~7.6×) but yields little wall‑clock speedup on tested hardware; fine-tuning after pruning helps but results vary by model. Also, smaller models often outperform oversized ones on small datasets, so pick model size carefully.

Problem Statement

Transformer-based time-series models are growing large and costly. It is unclear how well common pruning methods (unstructured magnitude pruning and structured DepGraph node pruning) reduce model size, runtime, and predictive error for state-of-the-art time-series Transformers in practical settings.

Main Contribution

Trains and prunes five Transformer-based time-series models (Transformer, Informer, Autoformer, FEDformer, Crossformer) on multiple public datasets and horizons.

Compares unstructured magnitude pruning and structured pruning (torch-pruning / DepGraph) on predictive loss, parameter density, FLOPs and measured inference time.

Measures effect of fine-tuning after pruning, experiments with reduced model sizes, and studies pruning on a very large dataset (ENTSO-E).

Key Findings

Most models sustain pruning to about 50% density with little test loss increase.

Numbers≈50% density without significant MSE rise (Fig.1, Sec.4.1).

Autoformer and FEDformer remain competitive even when pruned to very high sparsity.

NumbersCompetitive loss down to ~1% density for Autoformer/FEDformer (Sec.4.1).

Structured pruning can reduce theoretical FLOPs a lot but gives small real speed gains on tested hardware.

NumbersInformer FLOP reduction 7.63×, measured speedup 1.51×; others show ≤1.31× (Table 3).

Fine-tuning after pruning recovers accuracy variably across models.

NumbersCrossformer: trained MSE 0.3456 → pruned 0.8124 → fine‑tuned 0.3502 (Table 4).

Smaller models often outperform large ones on small datasets; resizing beats pruning in some cases.

NumbersTransformer on ETTm2: small MSE 0.3099 vs large 0.5020 (≈38% relative improvement) (Table 5).

Results

Safe pruning baseline

ValueMany models maintain test MSE at ≈50% density

Baselineunpruned

Extreme pruning tolerance (Auto/FED)

ValueComparable loss at ~1% density

Baselineother models at higher density

Structured pruning — FLOP reduction

ValueInformer FLOP reduction 7.63×

Baselineunpruned

Structured pruning — measured speedup

ValueInformer speedup 1.51×; others ≤1.31×, FEDformer slower (0.98×)

Baselineunpruned

Fine-tuning effect

ValueCrossformer recovers to near-original MSE after fine-tune

Baselinetrained→pruned

Right-sizing vs pruning

ValueSmall Transformer on ETTm2 MSE 0.3099 vs large 0.5020

Baselinelarge model

Who Should Care

What To Try In 7 Days

Apply unstructured magnitude pruning to 50% density on your model; validate test MSE.

Fine-tune pruned model for a few epochs and compare accuracy vs baseline.

Benchmark structured pruning (DepGraph) to measure FLOP and real runtime change on your target GPU and inference stack.

Optimization Features

Infra Optimization

  • Measured on NVIDIA A100/H100; runtime gains depend on CUDA kernels

Model Optimization

  • Unstructured magnitude pruning (masking weights)
  • Structured node pruning via DepGraph (torch-pruning)
  • Reduce linear embedding sizes (right-size model)

System Optimization

  • Need hardware/software support for sparse kernels to get real speedups

Training Optimization

  • Train larger model then prune and fine-tune
  • Accuracy

Inference Optimization

  • Structured pruning reduces FLOPs but not guaranteed runtime gains
  • TensorRT compilation failed for published implementations

Reproducibility

Code Urls

  • to be released upon publication (authors state code will be made public)

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Unstructured sparsity only masked; no specialized sparse kernels used so no native speedups.
  • DepGraph structured pruning failed to reach target sparsities for some models and datasets.
  • TensorRT compilation for faster inference failed on published implementations, blocking deployment test.
  • Experiments limited to five Transformer variants and selected public datasets; results may differ for other models or domains.

When Not To Use

  • If your inference stack lacks sparse kernel or compiler support — unstructured pruning won’t speed runtime.
  • If model code cannot be recompiled or simplified for TensorRT, structured pruning may not yield runtime gains.
  • For tiny datasets where right-sizing the model is cheaper and more stable than pruning.

Failure Modes

  • Exploding gradients or out-of-memory errors at extreme sparsities during training.
  • DepGraph pruner producing unexpected lower-than-target sparsity because of dependency graph constraints.
  • FLOP reduction without wall-clock speedup due to kernel and non-linear-module bottlenecks.

Core Entities

Models

  • Transformer
  • Informer
  • Autoformer
  • FEDformer
  • Crossformer

Metrics

  • MSE
  • Parameter density
  • Sparsity
  • FLOP reduction
  • Inference speedup
  • Epoch time

Datasets

  • ETTm1
  • ETTm2
  • ECL
  • Exchange
  • Traffic
  • Weather
  • ENTSO-E