190 papers found

Task‑agnostic structured pruning for LLMs that cuts params and memory, then recovers in hours with 50K samples

0.70
0.60
0.70
73

You can cut ~20% of an LLM's parameters to save memory and speed without losing most zero‑shot ability, and re-tune it in hours on one GPU with modest public data—much cheaper than retraining or distillation.

Key finding

20% parameter pruning can be recovered to near-original zero-shot accuracy with limited tuning

Numbers: Pruned 20% → average accuracy 60.07; baseline 63.2594.97% retained

Prune 50–60% of GPT-scale weights in one pass, no retraining, with minor accuracy loss

0.80
0.70
0.90
69

SparseGPT can cut model memory and inference compute roughly in half for massive GPT models, enabling cheaper hosting and faster inference without retraining. Joint sparsity+quantization can match lower-bit storage with better accuracy than pure quantization.

Key finding

Large GPT models can be pruned to 50–60% unstructured sparsity in one shot with little accuracy loss.

Numbers: 5060% sparsity; removes ≈100B weights from OPT-175B/BLOOM-176B

Wanda: prune LLM weights by weight magnitude × input-activation norm — no retraining, much faster than prior LLM pruning

0.70
0.60
0.80
52

Wanda makes one-shot LLM pruning cheap and practical: you can cut ~50% of parameters without retraining and with minimal calibration data, saving memory and possibly inference cost while keeping most model quality.

Key finding

Wanda greatly reduces language-modeling loss vs magnitude pruning on LLaMA-7B at 50% sparsity

Numbers: Perplexity 7.26 (Wanda) vs 17.29 (magnitude) on WikiText (LLaMA-7B, 50%)

Practical review of quantization, pruning, distillation and low-rank compression for LLMs

0.60
0.30
0.80
37

Compression cuts model memory, cost and inference latency so LLMs can run on fewer GPUs or at lower cloud cost; pick compression that fits your accuracy budget and hardware.

Key finding

Weight-only post-training quantization (e.g., GPTQ) can reduce weight precision to 3 bits with small accuracy loss and large runtime gains.

Numbers: Wikitext-2 perp +0.34; speedup 3.24× (GPTQ, Table 1)

ShortGPT: remove low-impact layers to cut ~25% size while keeping ≈90% of performance

0.60
0.50
0.70
15

ShortGPT offers a low-effort way to cut model size and inference cost by removing low-impact layers; it preserves most classification-style performance and stacks with quantization to save more compute.

Key finding

Removing 10 out of 40 layers (25%) from LLaMA2-13B reduced MMLU from 55.0 to 52.2.

Numbers: MMLU 55.0 -> 52.2 (25% layers removed)

Practical guide to compressing Transformers: quantization, pruning, distillation and efficient architectures

0.75
0.40
0.85
13

Compression makes large Transformers affordable to run and store: use post-training quantization and structured pruning for immediate cost and latency gains without full retraining.

Key finding

8-bit and 6-bit post-training quantization often works well, but extreme low-bit (4-bit or below) frequently degrades performance.

Numbers: Table 2: ViT-B top-1 84.54 (FP) vs 8-bit PTQ 76.98, 6-bit PTQ 75.26/81.65 depending on method

A practical survey of compression and speed tricks to run large language models on limited hardware

0.80
0.50
0.85
13

Compression and better kernels let teams run large LLMs on fewer GPUs or even on single workstations, cutting hosting costs and enabling edge/embedded use cases without losing core capabilities.

Key finding

Quantizing FP32 weights to 4-bit cuts model size roughly to one-eighth.

Numbers: ≈1/8 model size when FP32→INT4

Cut big LLMs into smaller ones by pruning plus distillation; same or better accuracy with far less retraining data.

0.80
0.60
0.80
10

If you run multiple model sizes, prune a big pretrained model and distill smaller variants to cut token and compute costs dramatically while keeping or improving accuracy.

Key finding

Pruning-plus-distillation cuts extra-model training tokens by about 40× versus training that size from scratch.

Numbers: Up to 40× fewer tokens to derive 8B/4B (Abstract; Table 2,3)

Initialize the student from the teacher and prune it slowly while distilling to keep predictions close and improve small models

0.60
0.60
0.60
9

HomoDistil produces smaller, better-performing BERT derivatives by pruning from the teacher while distilling; this saves storage and lowers fine-tuning costs while preserving quality—useful when you need compact models with higher accuracy than typical distilled alternatives.

Key finding

HomoBERT-base (65M) improves GLUE average over DistilBERT-style baselines.

Numbers: GLUE avg score 83.8 vs DistilBERT 82.1 on dev

Learned token-dropping that prunes up to 80% of context to speed and shrink autoregressive Transformers

0.70
0.60
0.75
8

You can cut inference memory and often double throughput on long prompts by fine-tuning a small module, reducing costs for batched long-context services.

Key finding

The fine-tuned model can remove ~80% of prior tokens with almost no perplexity loss.

Numbers: 80.35% sparsity → −0.085 avg perplexity (context=1000) vs dense

ZipLM: inference-aware structured pruning that gives runtime speedup guarantees across devices

0.80
0.40
0.80
7

ZipLM cuts inference cost and risk: it produces many valid speedup targets in one run and guarantees measured speedups on target hardware, reducing GPU/CPU time and deployment surprises.

Key finding

ZipLM beats CoFi and TinyBERT on SQuAD at the same speedup.

Numbers: ≈ +3 F1 points vs CoFi at same speedup (SQuAD dev)

INT4 (4-bit) gives big latency wins for encoder models with little accuracy loss, but breaks decoder-only generators; optimized INT4 kernels

0.70
0.40
0.80
6

On Ampere GPUs, INT4 computation can sharply reduce latency and cost for encoder-based workloads (search, classification, embedding). But it is risky to use for autoregressive generation (chatbots, text generation) until activation-quantization problems are solved.

Key finding

Encoder models (BERT) keep accuracy under W4A4 QAT+KD.

Numbers: BERT-base MNLI 84.20 (FP32) → 84.31 (W4A4 symmetric)

Bonsai: prune large language models using only forward passes to cut memory needs and keep accuracy

0.75
0.70
0.85
6

Bonsai makes structured LLM compression feasible on commodity GPUs, cutting memory needs and producing faster models so teams can reduce inference cost and enable on-device fine-tuning without enterprise hardware.

Key finding

Bonsai cuts pruning memory requirements to inference-only levels, enabling pruning on ≈20GB devices instead of 80–160GB.

Numbers: pruning memory ≈20GB vs 80–160GB for gradient methods

Prune LLMs with LoRA gradients to get structured, fast models using far less memory

0.80
0.60
0.70
4

LoRAPrune cuts pruning memory and gives real GPU latency wins while keeping better accuracy than prior structured-pruning methods, enabling practical deployment of much larger LLMs on fewer GPUs.

Key finding

At 50% structured compression, LoRAPrune yields much lower perplexity than a leading baseline (LLM-Pruner) on language modeling benchmarks.

Numbers: WikiText2: 11.60 vs 16.41 (delta -4.81); PTB: 17.39 vs 20.85 (delta -3.46)

SquareHead L2 distillation enables high-sparsity fine-tuning and real CPU/GPU inference speedups

0.70
0.60
0.70
4

Sparsity plus SquareHead can reduce LLM inference latency and cost on CPUs/GPUs (2–8x) while keeping accuracy for many tasks, enabling cheaper deployment on commodity hardware.

Key finding

SquareHead (L2 feature distillation) stabilizes sparse fine-tuning and recovers accuracy where CE and standard KD diverge.

Prune LLMs with LoRA-aware dependency graphs, progressive pruning, and dynamic recovery to cut footprint with limited GPUs

0.60
0.70
0.70
4

LoRAShear aims to cut LLM parameter footprint with a workflow that fits a single A100 GPU, enabling smaller teams to try structured pruning without large clusters while keeping practical accuracy on some benchmarks.

Key finding

Authors report a 20% parameter pruning with only ~1.0% performance regression to the full LLAMA v1

Numbers: 20% prune → 1.0% regression (paper claim)

Use activation entropy + channel shuffling to get one-shot N:M sparsity for LLMs with big memory and latency wins

0.70
0.60
0.80
4

E-Sparse cuts LLM GPU memory by ~43% and speeds matrix work 1.24–1.53× on Ampere hardware, letting teams host larger models or reduce instance costs with small accuracy trade-offs.

Key finding

E-Sparse reduces LLaMA-13B WikiText perplexity under 2:4 sparsity to 8.26.

Numbers: LLaMA-13B 2:4 perplexity = 8.26 (FP16 = 5.09)

Compression can preserve or break LLM trust: 4-bit quantization often keeps or even improves ethics/fairness, pruning and 3-bit quantization

0.70
0.60
0.80
4

Compression can save cost and enable deployment on consumer GPUs, but it can also change model safety in ways that accuracy tests miss. Pick compression methods and bit-rates with trust tests, not just MMLU.

Key finding

4-bit post-training quantization usually preserves trustworthiness within small margins.

Numbers: ≤5-point drop across 8 trust metrics (LLAMA2 13b Chat, 4-bit)

Combine pruning, distillation and post-training quantization to run a ViT-style segmenter on a 4GB Jetson Nano with small accuracy loss

0.45
0.40
0.65
3

You can run a practical ViT-style segmenter on a $100–$150 Jetson Nano by combining distillation and fp16 quantization, giving near-teacher accuracy while keeping model size and RAM within real device limits.

Key finding

Distillation substantially boosts MobileViT segmentation accuracy.

Numbers: MIoU from 0.5365 to 0.6056 (+0.069)

Cut LLM size by pruning neurons used outside your domain, keep performance after short fine-tuning

0.40
0.50
0.60
3

Contextual pruning can cut domain-model size quickly (often ~10%) and enable cheaper on-prem or lower-latency deployments while keeping task performance after short fine-tuning.

Key finding

Mild contextual pruning (linear+activation threshold 1e-3, embed prune <=0) reduced usable model size by ~10% while preserving or improving perplexity after brief fine-tuning.

Numbers: Phi-1.5 Medical: relative size 90.134%, perplexity 4.644.5792.722 (post-prune→fine-tune) (Table 3)

Survey of pruning, quantization, distillation and low-cost methods for compressing modern LLMs

0.70
0.30
0.90
3

Compressing LLMs cuts hosting and inference costs and enables deployment on cheaper hardware; low-cost, post-training methods make this feasible without retraining large models.

Key finding

Low-cost, post-training methods now enable compression of very large LLMs without full retraining.

Numbers: SparseGPT prunes a 175B model in ~3 hours on one A100

BESA: differentiable block-wise pruning that learns layer sparsity — prunes 7B–70B models on one A100 in hours

0.70
0.60
0.70
3

BESA makes aggressive pruning of large LLMs practical on a single A100 GPU, preserving model quality and enabling lower-cost deployment or faster inference when paired with quantization.

Key finding

Lower perplexity than prior one-shot methods at 50% unstructured sparsity on LLaMA models.

Numbers: Example: LLaMA2-70B Wikitext2 ppl BESA 4.09 vs SparseGPT 4.25 (Table 1)

Drop MoE layers or blocks + quantize experts to cut memory and run time with small accuracy loss

0.70
0.40
0.80
3

Coarse structural compression plus 4-bit quantization can cut inference cost and memory enough to run large MoE models on cheaper GPUs while losing only a small fraction of task accuracy.

Key finding

Block Drop + 4-bit quantization produces major runtime and memory reductions while keeping most accuracy.

Numbers: 6.05× speedup; memory 20.0GB; >92% performance (Mixtral-8×7B)

Moderate WANDA pruning (10–20%) increases jailbreak resistance of 7B LLMs without fine-tuning

0.60
0.50
0.70
3

Pruning attention weights at modest sparsity (10–20%) is a low-cost safety lever: it can raise refusal rates to harmful prompts and shrink model size without extra fine-tuning or big performance loss.

Key finding

Moderate attention-layer WANDA pruning increases refusal rates to jailbreak prompts.

Numbers: LLaMA-2: average +8.5% refusal rate across five categories