Factor transformer weight matrices into a small dense basis and sparse per-row coefficients to get stronger compression than low-rank factos

December 20, 20238 min

Overview

Decision SnapshotNeeds Validation

Method shows consistent gains on GLUE and SQuAD dev sets and is practical to train on GPU clusters; however, wall-clock end-to-end speedups on hardware are not shown and results are task-specific.

Citations1

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Rahul Chand, Yashoteja Prabhu, Pratyush Kumar

Links

Abstract / PDF / Data

Why It Matters For Business

DSFormer reduces transformer model size substantially (2x–3.6x) while keeping accuracy close to original models and can be stacked with distillation/quantization to cut hosting or edge deployment costs further.

Who Should Care

Summary TLDR

DSFormer compresses Transformer weight matrices by splitting each block into a small dense basis and a semi-structured sparse coefficient matrix. This 'dense-sparse' factorization fits transformer weight geometry better than classic low-rank SVD, and is trained with a Straight-Through Factorizer (STF) optimizer that refines discrete sparsity during fine-tuning. On GLUE and SQuAD dev sets, DSFormer hits practical compression points (2x, 2.8x, 3.57x) while keeping accuracy close to BERTBASE (within ~1% at 2.8x) and outperforming several low-rank and semi-structured baselines. The method is orthogonal to distillation and quantization and can be stacked to cut model size further. The paper does:

Problem Statement

Large transformer models are expensive to run. Low-rank factorization is easy to apply but often too restrictive for transformer weights and yields poor compression–accuracy trade-offs. We need a compact factorization that better matches real weight structure and can be learned in a task-aware way.

Main Contribution

DSFormer: a dense–sparse block factorization that represents each weight block with a small dense basis and sparse per-row coefficients.

STF (Straight-Through Factorizer): an efficient training trick that alternates a cheap continuous update for the dense basis with fresh sparse solves per forward pass to learn factorization end-to-end.

Key Findings

DSFormer achieves up to ~40% better compression than low-rank factorizers on evaluated tasks.

Numbers"up to 40% better compression" (Abstract; Experiments)

Practical UseExpect smaller models than low-rank factorization at matched accuracy; try DSFormer when you need tighter size-vs-accuracy trade-offs.

Evidence RefAbstract / Sec 5

At 2.8x compression DSFormer keeps GLUE average within ~1% of BERTBASE.

NumbersGLUE avg: DSFormer 2.8x = 81.85 vs BERTBASE = 82.05 (Table 1)

Practical UseYou can reduce model size to ~35% of original and still retain near-baseline GLUE performance for many NLU tasks.

Evidence RefTable 1 (GLUE dev)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GLUE average (dev)DSFormer 2x = 83.13; DSFormer 2.8x = 81.85; BERTBASE = 82.05BERTBASE2x: +1.08 avg vs BERT; 2.8x: −0.2 avg vs BERTGLUE devTable 1 dev setTable 1
SQuAD v1.1 (EM / F1)BERTBASE 80.8 / 88.5; DSFormer 2x 79.91 / 88.01; 2.8x 79.6 / 87.42BERTBASE2x: −0.89 EM, −0.49 F1 vs BERTSQuAD v1.1 devTable 3 dev setTable 3

What To Try In 7 Days

Apply DSFormer factorization (γ=1/4, δ=3/16) to attention and FFN weights to get ~2.8x compression and validate on your dev set.

Use the FT-F-STF schedule: fine-tune, factorize, then run STF stage to refine sparse structure for better accuracy.

Stack DSFormer on top of an existing distilled or quantized model to target another ≈2x size reduction and measure final accuracy.

Optimization Features

Infra Optimization
Designed to run on commodity CPUs/GPUsBenefits from NVidia Ampere-like sparse-tensor support when available
Model Optimization
Dense–sparse block factorization (small dense basis × sparse coefficients)Semi-structured sparsity per block (S non-zeros per K rows)
System Optimization
Cache-aware block matmul implementation to raise computational intensitySemi-structured sparsity yields predictable memory access
Training Optimization

STF: alternating cheap gradient update for dense basis and OMP solve for sparse codes per forward pa

FT-F-STF training schedule (fine-tune → factorize → STF refine)

Inference Optimization
Block-wise dense matmul D·X then semi-structured sparse S·(D·X) enabling cache-friendly accessLeverages regular sparsity patterns suitable for hardware sparse tensor cores

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

No end-to-end wall-clock speed-up measurements on real hardware are provided.

DSFormer is task-specific and requires a separate compressed model per task.

When Not To Use

When you need a single task-agnostic compressed model for many tasks.

When your target hardware lacks support for semi-structured sparsity and you cannot exploit predictable sparse patterns.

Failure Modes

At extreme compression (e.g., 3.57x) accuracy can drop noticeably for some tasks.

Stacking DSFormer on already very compressed models (ALBERT 12x) can cause >2% drop on difficult datasets (Table 2).

Core Entities

Models

DSFormerBERTBASEDistilBERTTinyBERT-6DRONEFWSVDNxMTransformerASPALBERTQ8BERT

Metrics

AccuracyF1Exact Match (EM)SpearmanGLUE average

Datasets

GLUESQuAD v1.1SQuAD v2.0

Benchmarks

GLUE devSQuAD dev

Context Entities

Models

ALBERTQ8BERTTinyBERTDistilBERT

Metrics

Compression ratio (CR)AccuracyF1EM

Datasets

GLUESQuAD

Benchmarks

GLUESQuAD