Pruning Transformers for time-series: big FLOP drops but small real speedups; fine-tune and right-size models

Overview

Decision SnapshotNeeds Validation

Empirical benchmark across five models and multiple datasets shows repeatable trends. Results are strong on FLOP and density metrics but hardware-dependent for runtime; code will be released which aids reproducibility.

Citations1

Evidence Strength0.70

Confidence0.86

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 40%

Authors

Nicholas Kiefer, Arvid Weyrauch, Muhammed Öz, Achim Streit, Markus Götz, Charlotte Debus

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Pruning cuts model size and theoretical compute but doesn't guarantee runtime speedups; measuring on your hardware and considering smaller architectures first saves cost and deployment time.

Who Should Care

CTO ML Engineer Product Manager Data Scientist Founder

Summary TLDR

This paper benchmarks unstructured (magnitude) and structured (DepGraph) pruning on five Transformer-based multivariate time-series models across standard datasets. Key takeaways: you can usually prune ~50% of weights with little loss; Autoformer and FEDformer tolerate extreme pruning (down to ~1% of original params) on evaluated data; structured pruning reduces FLOPs strongly (up to ~7.6×) but yields little wall‑clock speedup on tested hardware; fine-tuning after pruning helps but results vary by model. Also, smaller models often outperform oversized ones on small datasets, so pick model size carefully.

Problem Statement

Transformer-based time-series models are growing large and costly. It is unclear how well common pruning methods (unstructured magnitude pruning and structured DepGraph node pruning) reduce model size, runtime, and predictive error for state-of-the-art time-series Transformers in practical settings.

Main Contribution

Trains and prunes five Transformer-based time-series models (Transformer, Informer, Autoformer, FEDformer, Crossformer) on multiple public datasets and horizons.

Compares unstructured magnitude pruning and structured pruning (torch-pruning / DepGraph) on predictive loss, parameter density, FLOPs and measured inference time.

Key Findings

Most models sustain pruning to about 50% density with little test loss increase.

Numbers≈50% density without significant MSE rise (Fig.1, Sec.4.1).

Practical UseIn practice, try an immediate 2× parameter reduction (mask + fine-tune) as a low-risk step to cut model size.

Evidence RefSec.4.1, Fig.1

Autoformer and FEDformer remain competitive even when pruned to very high sparsity.

NumbersCompetitive loss down to ~1% density for Autoformer/FEDformer (Sec.4.1).

Practical UseIf you use Autoformer/FEDformer, you can test aggressive pruning for deployment, but validate stability and fine-tune afterward.

Evidence RefSec.4.1, Fig.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Safe pruning baseline	Many models maintain test MSE at ≈50% density	unpruned	≈2× parameter reduction	Multiple datasets (ETT, ECL, Traffic, Weather, ENTSO-E)	Fig.1, Sec.4.1	Fig.1
Extreme pruning tolerance (Auto/FED)	Comparable loss at ~1% density	other models at higher density	down to 1% params	Various small datasets (reported in Sec.4.1)	Sec.4.1, Fig.1	Sec.4.1

What To Try In 7 Days

Apply unstructured magnitude pruning to 50% density on your model; validate test MSE.

Fine-tune pruned model for a few epochs and compare accuracy vs baseline.

Benchmark structured pruning (DepGraph) to measure FLOP and real runtime change on your target GPU and inference stack.

Optimization Features

Infra Optimization

Measured on NVIDIA A100/H100; runtime gains depend on CUDA kernels

Model Optimization

Unstructured magnitude pruning (masking weights)Structured node pruning via DepGraph (torch-pruning)Reduce linear embedding sizes (right-size model)

System Optimization

Need hardware/software support for sparse kernels to get real speedups

Training Optimization

Train larger model then prune and fine-tuneAccuracy

Inference Optimization

Structured pruning reduces FLOPs but not guaranteed runtime gainsTensorRT compilation failed for published implementations

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

to be released upon publication (authors state code will be made public)

Data URLs

https://transparency.entsoe.eu (ENTSO-E)datasets from Autoformer/Informer source repositories (public)

Risks & Boundaries

Limitations

Unstructured sparsity only masked; no specialized sparse kernels used so no native speedups.

DepGraph structured pruning failed to reach target sparsities for some models and datasets.

When Not To Use

If your inference stack lacks sparse kernel or compiler support — unstructured pruning won’t speed runtime.

If model code cannot be recompiled or simplified for TensorRT, structured pruning may not yield runtime gains.

Failure Modes

Exploding gradients or out-of-memory errors at extreme sparsities during training.

DepGraph pruner producing unexpected lower-than-target sparsity because of dependency graph constraints.

Core Entities

Models

TransformerInformerAutoformerFEDformerCrossformer

Metrics

MSEParameter densitySparsityFLOP reductionInference speedupEpoch time

Datasets

ETTm1ETTm2ECLExchangeTrafficWeatherENTSO-E

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Most models sustain pruning to about 50% density with little test loss increase.

Autoformer and FEDformer remain competitive even when pruned to very high sparsity.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding

Smaller, faster NLLB-based models for 15 African language pairs, with released data and code

Key finding

Compression can preserve or break LLM trust: 4-bit quantization often keeps or even improves ethics/fairness, pruning and 3-bit quantization

Key finding

Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

Key finding