117 papers found

Prune 50–60% of GPT-scale weights in one pass, no retraining, with minor accuracy loss

0.80
0.70
0.90
69

SparseGPT can cut model memory and inference compute roughly in half for massive GPT models, enabling cheaper hosting and faster inference without retraining. Joint sparsity+quantization can match lower-bit storage with better accuracy than pure quantization.

Key finding

Large GPT models can be pruned to 50–60% unstructured sparsity in one shot with little accuracy loss.

Numbers: 5060% sparsity; removes ≈100B weights from OPT-175B/BLOOM-176B

Lossless 3-bit LLM quantization with dense-and-sparse weights

0.80
0.70
0.80
23

SqueezeLLM cuts model storage and single‑request latency by ~2× while keeping near‑FP16 quality, enabling cheaper and faster on‑prem or cloud inference for generative LLMs.

Key finding

3‑bit dense SqueezeLLM on LLaMA‑7B achieves perplexity 7.75 on C4 versus FP16 7.08 and GPTQ 9.55.

Numbers: LLaMA‑7B (3‑bit): SqueezeLLM PPL 7.75, FP16 7.08, GPTQ 9.55

Initialize the student from the teacher and prune it slowly while distilling to keep predictions close and improve small models

0.60
0.60
0.60
9

HomoDistil produces smaller, better-performing BERT derivatives by pruning from the teacher while distilling; this saves storage and lowers fine-tuning costs while preserving quality—useful when you need compact models with higher accuracy than typical distilled alternatives.

Key finding

HomoBERT-base (65M) improves GLUE average over DistilBERT-style baselines.

Numbers: GLUE avg score 83.8 vs DistilBERT 82.1 on dev

Many pre-trained transformers already contain a large "free" sparse subnetwork you can remove with little cost

0.65
0.60
0.70
6

You can cut 30–50% of many pre-trained transformers with one cheap pruning pass and little accuracy loss, saving memory and inference costs without costly retraining.

Key finding

Pre-trained transformers have an 'essential sparsity' range where many weights can be removed with minimal loss.

Numbers: About 3050% weights removable with <=1% downstream drop (evaluated tasks)

Bonsai: prune large language models using only forward passes to cut memory needs and keep accuracy

0.75
0.70
0.85
6

Bonsai makes structured LLM compression feasible on commodity GPUs, cutting memory needs and producing faster models so teams can reduce inference cost and enable on-device fine-tuning without enterprise hardware.

Key finding

Bonsai cuts pruning memory requirements to inference-only levels, enabling pruning on ≈20GB devices instead of 80–160GB.

Numbers: pruning memory ≈20GB vs 80–160GB for gradient methods

Prune LLMs with LoRA gradients to get structured, fast models using far less memory

0.80
0.60
0.70
4

LoRAPrune cuts pruning memory and gives real GPU latency wins while keeping better accuracy than prior structured-pruning methods, enabling practical deployment of much larger LLMs on fewer GPUs.

Key finding

At 50% structured compression, LoRAPrune yields much lower perplexity than a leading baseline (LLM-Pruner) on language modeling benchmarks.

Numbers: WikiText2: 11.60 vs 16.41 (delta -4.81); PTB: 17.39 vs 20.85 (delta -3.46)

SquareHead L2 distillation enables high-sparsity fine-tuning and real CPU/GPU inference speedups

0.70
0.60
0.70
4

Sparsity plus SquareHead can reduce LLM inference latency and cost on CPUs/GPUs (2–8x) while keeping accuracy for many tasks, enabling cheaper deployment on commodity hardware.

Key finding

SquareHead (L2 feature distillation) stabilizes sparse fine-tuning and recovers accuracy where CE and standard KD diverge.

Use activation entropy + channel shuffling to get one-shot N:M sparsity for LLMs with big memory and latency wins

0.70
0.60
0.80
4

E-Sparse cuts LLM GPU memory by ~43% and speeds matrix work 1.24–1.53× on Ampere hardware, letting teams host larger models or reduce instance costs with small accuracy trade-offs.

Key finding

E-Sparse reduces LLaMA-13B WikiText perplexity under 2:4 sparsity to 8.26.

Numbers: LLaMA-13B 2:4 perplexity = 8.26 (FP16 = 5.09)

Train task-focused supervised fine-tuning and preference alignment in parallel, then sparsify and merge adapters to avoid alignment tax.

0.70
0.60
0.60
4

PAFT can preserve both task accuracy and alignment without retraining large models end-to-end; companies can run SFT and alignment in parallel, sparsify adapters, and merge them to ship stronger, aligned models faster.

Key finding

Parallel training (PAFT) plus L1-sparsified SFT improves merged-model scores versus sequential or standalone training on the 6-task Open LLM suite.

Numbers: PAFT (SFTsparse + DPO) avg=0.65243 vs DPO-alone 0.6333 (Mistral-7B)

BESA: differentiable block-wise pruning that learns layer sparsity — prunes 7B–70B models on one A100 in hours

0.70
0.60
0.70
3

BESA makes aggressive pruning of large LLMs practical on a single A100 GPU, preserving model quality and enabling lower-cost deployment or faster inference when paired with quantization.

Key finding

Lower perplexity than prior one-shot methods at 50% unstructured sparsity on LLaMA models.

Numbers: Example: LLaMA2-70B Wikitext2 ppl BESA 4.09 vs SparseGPT 4.25 (Table 1)

Sparsify then quantize — the proven best order; combining them still adds nontrivial error

0.70
0.60
0.80
2

If you compress models with pruning and block-wise quantization, order and method choice change accuracy and thus service quality; using sparsity before quantization (S → Q) is an easy rule to reduce avoidable accuracy loss.

Key finding

Sparsity and max-scaled block-wise quantization are non-orthogonal.

TEAL: thresholding hidden activations to cut memory movement and speed up LLM decoding without extra training

0.70
0.60
0.70
2

TEAL cuts memory movement during single-batch decoding and delivers up to ~1.8× throughput gain without retraining, lowering inference cost for edge or low-latency deployments.

Key finding

TEAL achieves 40–50% model-wide activation sparsity with small accuracy degradation on evaluated Llama-2/3 and Mistral models.

Numbers: Perplexity: LLaMA-3-8B PPL 5.876.21 (40%), 6.67 (50%) on WikiText

Compress PEFT adapters 8–50x with sparse ternary encoding, often preserving or improving accuracy

0.80
0.60
0.80
2

ComPEFT slashes adapter size and transfer time so you can host many more task experts per GPU, reduce bandwidth costs, and cut serving latency without retraining.

Key finding

ComPEFT compresses PEFT updates by 8x–50x without retraining.

Numbers: 8x50x compression (reported vs 16-bit checkpoints)

Use each layer's outlier count to set non-uniform sparsity for much better LLM pruning

0.70
0.70
0.80
2

OWL lets you prune large language models up to ~70% while keeping useful quality and delivering real CPU speedups, enabling cheaper, faster inference and easier deployment on constrained hardware.

Key finding

OWL sharply lowers perplexity vs Wanda at 70% sparsity on LLaMA-7B.

Numbers: Wanda 85.77 → OWL w. Wanda 24.55 (∆ −61.22) on WikiText

Quest speeds long-context LLM decoding by loading only the KV cache pages likely relevant to the current query

0.70
0.70
0.80
1

Quest reduces memory bandwidth and decode latency for very long-context LLM calls, lowering GPU cost per request and improving responsiveness for document-heavy applications.

Key finding

Quest achieves large self-attention speedups by loading only top-K pages instead of the full KV cache.

Numbers: 7.03× self-attention speedup at 32K seq, token budget 2048

Flash-LLM: run sparsified LLMs on tensor cores with up to ~3–3.8× real inference speedups and lower GPU cost

0.80
0.65
0.80
1

Flash-LLM cuts inference GPU cost and increases throughput for production LLM serving by enabling practical unstructured sparsity on tensor cores.

Key finding

Flash-LLM speeds kernel SpMM 2–3.6× over Sputnik and 1.4–1.6× over SparTA depending on sparsity.

Numbers: avg 3.6×/1.4× at 70% sparsity; 3.0×/1.4× at 80%; 2.0×/1.6× at 90%

Combine sparse neuron activity with weight pruning to cut RNN inference work up to ~20× while keeping language-model quality

0.60
0.50
0.70
1

If you can deploy on event-driven or neuromorphic hardware, combining sparse activations with weight pruning can cut inference work dramatically without large quality loss, lowering energy and latency for low-power or real-time apps.

Key finding

Activity sparsity and weight sparsity multiply to reduce operations.

Numbers: Effective operations scale ≈ λ_a × λ_w (analytic relation)

Factor transformer weight matrices into a small dense basis and sparse per-row coefficients to get stronger compression than low-rank factos

0.60
0.70
0.70
1

DSFormer reduces transformer model size substantially (2x–3.6x) while keeping accuracy close to original models and can be stacked with distillation/quantization to cut hosting or edge deployment costs further.

Key finding

DSFormer achieves up to ~40% better compression than low-rank factorizers on evaluated tasks.

Numbers: "up to 40% better compression" (Abstract; Experiments)

SpIEL: memory-efficient sparse fine-tuning that scales PEFT to LLaMA‑2 (7B, 13B) and works with 4‑bit quantization

0.70
0.60
0.70
1

SpIEL lets teams fine-tune large LLMs with much less extra GPU memory by tuning only a sparse set of parameters, enabling on-prem or single-GPU adaptation and cheaper experimentation under quantization.

Key finding

SpIEL-AG improves MMLU on LLaMA2-7B trained on Flan v2 versus LoRA.

Numbers: MMLU 50.7 (SpIEL-AG) vs 49.3 (LoRA); +1.4 pts

Attention is provably n^C-sparse: use α·n^C top entries for stable sparse attention

0.60
0.70
0.70
1

Using a context-aware window k = α·n^C can cut attention compute while keeping accuracy; dynamic windows can be tuned to a given compute budget and often beat tiny fixed windows on long inputs.

Key finding

Keeping Ω(n^C) top entries per query suffices for vanishing approximation error.

Numbers: Error = O(1/n^C) when k = Ω(n^C) (Theorem 6.1)

A principled de-noising dequantization makes stable training possible at 1-bit and sub-1-bit precision

0.70
0.70
0.80
1

This method makes extreme quantization and M:N sparsity reliable, letting you cut model storage and arithmetic cost while preserving accuracy—so you can deploy larger models under tight memory/energy budgets.

Key finding

Explicit dequantization stabilizes ultra-low-bit training where STE diverges.

Numbers: A1.5W1.5: STE failed; our method 0.3297 accuracy (Shakespeare / nanoGPT)

Pruning Transformers for time-series: big FLOP drops but small real speedups; fine-tune and right-size models

0.50
0.40
0.60
1

Pruning cuts model size and theoretical compute but doesn't guarantee runtime speedups; measuring on your hardware and considering smaller architectures first saves cost and deployment time.

Key finding

Most models sustain pruning to about 50% density with little test loss increase.

Numbers: ≈50% density without significant MSE rise (Fig.1, Sec.4.1).

Combine per-group 4-bit quantization with GPU-friendly group sparsity to speed LLM decoding with small accuracy loss.

0.70
0.60
0.80
1

GQSA delivers multi× inference speedups and big memory savings for 7B–30B class LLMs while often preserving or improving zero-shot accuracy, enabling cheaper, faster serving on GPUs and more viable edge deployment.

Key finding

GQSA preserves accuracy better than heavy quantization or 2:4 pruning on evaluated models.

Numbers: avg +5.4% acc vs OmniQuant W2 on LLaMA-2-7B (W4+S50%)

PGB: one-shot, group-and-permute pruning that compresses task-tuned BERT in hours with small accuracy loss

0.70
0.60
0.80
1

PGB cuts model compression time from days to hours while keeping most task accuracy, so teams can produce and deploy smaller BERT models faster and with lower compute cost.

Key finding

PGB preserves most GLUE accuracy at 50% pruning.

Numbers: QNLI 90.3 vs 91.4 baseline; SST-2 92.3 vs 93.2 baseline