Pruning Transformers for time-series: big FLOP drops but small real speedups; fine-tune and right-size models

December 17, 20247 min

Overview

Decision SnapshotNeeds Validation

Empirical benchmark across five models and multiple datasets shows repeatable trends. Results are strong on FLOP and density metrics but hardware-dependent for runtime; code will be released which aids reproducibility.

Citations1

Evidence Strength0.70

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 40%

Authors

Nicholas Kiefer, Arvid Weyrauch, Muhammed Öz, Achim Streit, Markus Götz, Charlotte Debus

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Pruning cuts model size and theoretical compute but doesn't guarantee runtime speedups; measuring on your hardware and considering smaller architectures first saves cost and deployment time.

Who Should Care

Summary TLDR

This paper benchmarks unstructured (magnitude) and structured (DepGraph) pruning on five Transformer-based multivariate time-series models across standard datasets. Key takeaways: you can usually prune ~50% of weights with little loss; Autoformer and FEDformer tolerate extreme pruning (down to ~1% of original params) on evaluated data; structured pruning reduces FLOPs strongly (up to ~7.6×) but yields little wall‑clock speedup on tested hardware; fine-tuning after pruning helps but results vary by model. Also, smaller models often outperform oversized ones on small datasets, so pick model size carefully.

Problem Statement

Transformer-based time-series models are growing large and costly. It is unclear how well common pruning methods (unstructured magnitude pruning and structured DepGraph node pruning) reduce model size, runtime, and predictive error for state-of-the-art time-series Transformers in practical settings.

Main Contribution

Trains and prunes five Transformer-based time-series models (Transformer, Informer, Autoformer, FEDformer, Crossformer) on multiple public datasets and horizons.

Compares unstructured magnitude pruning and structured pruning (torch-pruning / DepGraph) on predictive loss, parameter density, FLOPs and measured inference time.

Key Findings

Most models sustain pruning to about 50% density with little test loss increase.

Numbers≈50% density without significant MSE rise (Fig.1, Sec.4.1).

Practical UseIn practice, try an immediate 2× parameter reduction (mask + fine-tune) as a low-risk step to cut model size.

Evidence RefSec.4.1, Fig.1

Autoformer and FEDformer remain competitive even when pruned to very high sparsity.

NumbersCompetitive loss down to ~1% density for Autoformer/FEDformer (Sec.4.1).

Practical UseIf you use Autoformer/FEDformer, you can test aggressive pruning for deployment, but validate stability and fine-tune afterward.

Evidence RefSec.4.1, Fig.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Safe pruning baselineMany models maintain test MSE at ≈50% densityunpruned≈2× parameter reductionMultiple datasets (ETT, ECL, Traffic, Weather, ENTSO-E)Fig.1, Sec.4.1Fig.1
Extreme pruning tolerance (Auto/FED)Comparable loss at ~1% densityother models at higher densitydown to 1% paramsVarious small datasets (reported in Sec.4.1)Sec.4.1, Fig.1Sec.4.1

What To Try In 7 Days

Apply unstructured magnitude pruning to 50% density on your model; validate test MSE.

Fine-tune pruned model for a few epochs and compare accuracy vs baseline.

Benchmark structured pruning (DepGraph) to measure FLOP and real runtime change on your target GPU and inference stack.

Optimization Features

Infra Optimization
Measured on NVIDIA A100/H100; runtime gains depend on CUDA kernels
Model Optimization
Unstructured magnitude pruning (masking weights)Structured node pruning via DepGraph (torch-pruning)Reduce linear embedding sizes (right-size model)
System Optimization
Need hardware/software support for sparse kernels to get real speedups
Training Optimization
Train larger model then prune and fine-tuneAccuracy
Inference Optimization
Structured pruning reduces FLOPs but not guaranteed runtime gainsTensorRT compilation failed for published implementations

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Code URLs

to be released upon publication (authors state code will be made public)

Data URLs

https://transparency.entsoe.eu (ENTSO-E)datasets from Autoformer/Informer source repositories (public)

Risks & Boundaries

Limitations

Unstructured sparsity only masked; no specialized sparse kernels used so no native speedups.

DepGraph structured pruning failed to reach target sparsities for some models and datasets.

When Not To Use

If your inference stack lacks sparse kernel or compiler support — unstructured pruning won’t speed runtime.

If model code cannot be recompiled or simplified for TensorRT, structured pruning may not yield runtime gains.

Failure Modes

Exploding gradients or out-of-memory errors at extreme sparsities during training.

DepGraph pruner producing unexpected lower-than-target sparsity because of dependency graph constraints.

Core Entities

Models

TransformerInformerAutoformerFEDformerCrossformer

Metrics

MSEParameter densitySparsityFLOP reductionInference speedupEpoch time

Datasets

ETTm1ETTm2ECLExchangeTrafficWeatherENTSO-E