97 papers found

Practical review of quantization, pruning, distillation and low-rank compression for LLMs

0.60
0.30
0.80
37

Compression cuts model memory, cost and inference latency so LLMs can run on fewer GPUs or at lower cloud cost; pick compression that fits your accuracy budget and hardware.

Key finding

Weight-only post-training quantization (e.g., GPTQ) can reduce weight precision to 3 bits with small accuracy loss and large runtime gains.

Numbers: Wikitext-2 perp +0.34; speedup 3.24× (GPTQ, Table 1)

Practical guide to compressing Transformers: quantization, pruning, distillation and efficient architectures

0.75
0.40
0.85
13

Compression makes large Transformers affordable to run and store: use post-training quantization and structured pruning for immediate cost and latency gains without full retraining.

Key finding

8-bit and 6-bit post-training quantization often works well, but extreme low-bit (4-bit or below) frequently degrades performance.

Numbers: Table 2: ViT-B top-1 84.54 (FP) vs 8-bit PTQ 76.98, 6-bit PTQ 75.26/81.65 depending on method

A practical survey of compression and speed tricks to run large language models on limited hardware

0.80
0.50
0.85
13

Compression and better kernels let teams run large LLMs on fewer GPUs or even on single workstations, cutting hosting costs and enabling edge/embedded use cases without losing core capabilities.

Key finding

Quantizing FP32 weights to 4-bit cuts model size roughly to one-eighth.

Numbers: ≈1/8 model size when FP32→INT4

Cut big LLMs into smaller ones by pruning plus distillation; same or better accuracy with far less retraining data.

0.80
0.60
0.80
10

If you run multiple model sizes, prune a big pretrained model and distill smaller variants to cut token and compute costs dramatically while keeping or improving accuracy.

Key finding

Pruning-plus-distillation cuts extra-model training tokens by about 40× versus training that size from scratch.

Numbers: Up to 40× fewer tokens to derive 8B/4B (Abstract; Table 2,3)

Initialize the student from the teacher and prune it slowly while distilling to keep predictions close and improve small models

0.60
0.60
0.60
9

HomoDistil produces smaller, better-performing BERT derivatives by pruning from the teacher while distilling; this saves storage and lowers fine-tuning costs while preserving quality—useful when you need compact models with higher accuracy than typical distilled alternatives.

Key finding

HomoBERT-base (65M) improves GLUE average over DistilBERT-style baselines.

Numbers: GLUE avg score 83.8 vs DistilBERT 82.1 on dev

Combine Transformer + knowledge distillation to shrink models while keeping high GLUE accuracy (reported 98.32% Acc)

0.40
0.30
0.60
5

You can shrink model footprint and inference cost while keeping high accuracy by distilling a Transformer into a smaller model, lowering hardware and energy bills for production NLP services.

Key finding

TKD-NLP reports top GLUE numbers among tested models.

Numbers: Acc 98.32%; F1 97.14% on GLUE

Use ChatGPT to teach a smaller model to score answers and explain why

0.60
0.60
0.70
5

AERA turns expensive LLM reasoning into a deployable, smaller model that scores answers and explains decisions, lowering inference costs and improving explainability for education products.

Key finding

Distilled LongT5 (AERA) improves scoring agreement over ChatGPT on evaluated subsets.

Numbers: Overall QWK +11% vs ChatGPT (paper abstract; Table 1)

Combine pruning, distillation and post-training quantization to run a ViT-style segmenter on a 4GB Jetson Nano with small accuracy loss

0.45
0.40
0.65
3

You can run a practical ViT-style segmenter on a $100–$150 Jetson Nano by combining distillation and fp16 quantization, giving near-teacher accuracy while keeping model size and RAM within real device limits.

Key finding

Distillation substantially boosts MobileViT segmentation accuracy.

Numbers: MIoU from 0.5365 to 0.6056 (+0.069)

Train a small LLM for next-item recommendation that matches large LLMs while using ~13% of their parameters and running 6–8× faster.

0.70
0.50
0.80
3

You can shrink LLM-based recommenders to ~13% of original inference size and cut training/inference time by ~6–8× while keeping or slightly improving ranking quality, which reduces hardware cost and increases serving throughput.

Key finding

Many transformer decoder layers are redundant for sequential recommendation.

Distill the planner, not the solver: small models can learn decomposition cheaply and generalize

0.60
0.60
0.75
2

You can offload planning to a cheap local model and keep expensive models for final solving, cutting inference cost while keeping accuracy on reasoning tasks.

Key finding

Distilling only the decomposer preserves or improves two-stage reasoning performance versus a single-stage approach on evaluated benchmarks.

Numbers: GSM8K EM: static two-stage ~65.13 vs single-stage ~20.32 (Table 1)

DistilDP: use a DP-finetuned teacher to generate private synthetic text and distill a compact student without applying DP twice

0.60
0.60
0.60
2

DistilDP lets you produce a smaller, private language model with better utility than privately fine-tuning the small model directly, reducing deployment cost while respecting strong DP budgets.

Key finding

DistilDP substantially reduces perplexity on Big Patent versus private fine-tuning baselines.

Numbers: Big Patent: DistilDP PPL 32.43 vs DP-SGD student 41.8 (−9.37 PPL)

A practical guide to distilling big language models: methods, robustness tests, and domain apps

0.70
0.50
0.80
2

Distillation cuts LLM inference cost and memory while keeping most capabilities; picking the right KD style (white‑box, black‑box, CoT, retrieval‑augmented) matters for accuracy, robustness, and deployment budget.

Key finding

Hint‑based (feature) distillation often transfers richer information and yields higher task accuracy than logits‑only KD.

Numbers: TinyBERT: ~97% GLUE vs BERT; MiniLM: >99% SQuAD/GLUE with 50% Transformer size

Scale distillation by token confidence to train ternary-weight generative LMs with <1.0 PPL hit

0.70
0.60
0.80
2

TSLD lets you quantize decoder LMs to 2-bit ternary weights with near full-precision quality and little extra training cost, reducing model size and inference memory while preserving reasoning accuracy.

Key finding

TSLD keeps PPL degradation under 1.0 vs full-precision on evaluated models with ternary weights.

Numbers: OPT-6.7B PTB: FP16 PPL 10.21 → TSLD PPL 11.00 (+0.79)

Make transformer teachers teach CNN students better by aligning receptive fields and adding prompts

0.60
0.60
0.50
1

If you run face recognition on mobile or edge devices, distilling a high-performing Transformer into a CNN can boost verification accuracy substantially while keeping hardware-friendly inference.

Key finding

Cross-architecture KD with URFM+APT substantially improves large-scale verification.

Numbers: IJB-C TPR@FPR=1e-4: 94.4 (Ours) vs 89.13 (student baseline) +5.27

Factor transformer weight matrices into a small dense basis and sparse per-row coefficients to get stronger compression than low-rank factos

0.60
0.70
0.70
1

DSFormer reduces transformer model size substantially (2x–3.6x) while keeping accuracy close to original models and can be stacked with distillation/quantization to cut hosting or edge deployment costs further.

Key finding

DSFormer achieves up to ~40% better compression than low-rank factorizers on evaluated tasks.

Numbers: "up to 40% better compression" (Abstract; Experiments)

Distill explicit reward models and use pessimism to stop DPO’s degenerate alignment

0.60
0.60
0.50
1

If you fine-tune assistants from pairwise preferences, distilling explicit reward models (and using small ensembles) reduces brittle failures from biased or sparse preference labels while keeping offline training simple.

Key finding

DPO can converge to degenerate optima that place mass off-training and drive preferred-response likelihoods near zero.

Practical recipes to shrink large LLMs 5–20× and serve them with major latency wins

0.85
0.45
0.80
1

You can run near-FM quality models in production by distilling then pruning; this cuts serving cost and latency so ranking and generative features scale to real traffic.

Key finding

You can reduce a 100B+ foundation model to a compressed SLM for online serving with modest quality loss.

Numbers: model size reduced >20× (Abstract)

Distillation can 'hack' an imperfect teacher — online data and prompt diversity stop it

0.60
0.60
0.50
1

If you distill models from imperfect teachers, fixed offline distillation can degrade real-world quality; using online or more diverse data keeps smaller models reliable.

Key finding

Teacher hacking appears when distilling on a fixed offline dataset and training for many epochs.

Numbers: Observed U-shaped proxy–golden curve after long runs (50 epochs in experiments).

Use smoothed soft-label distillation during finetuning to reduce LLM hallucinations

0.70
0.50
0.45
1

Smoothing labels with KD can reduce made-up facts in summaries and answers, improving trustworthiness in high-stakes apps while keeping core model accuracy.

Key finding

KD reduces faithfulness hallucination on summarization benchmarks.

Numbers: Llama-2-7B ROUGE-L 28.028.8; Factual Consistency 86.3%87.7%

Distill long-context transformers: cut inference cost ~45–58% while keeping ~90–99% of task accuracy

0.70
0.40
0.80
1

Converting and then distilling long-context transformers cuts inference cost and latency substantially while keeping most accuracy, letting teams serve longer documents cheaper and on smaller hardware.

Key finding

Distilled efficient-attention students retain nearly all accuracy on short-context tasks.

Numbers: Up to 98.6% of teacher performance preserved (short-context GLUE/SQuAD/CoNLL-2003).

Teach vision-language models to reason about user-pointed image regions using an LLM-distilled 1M corpus

0.60
0.60
0.50
1

LSKD lets products accept user-pointed regions (tap/click) instead of long referring text, improving region-level answers and reducing UX friction for multimodal apps while using existing VL architectures.

Key finding

Large localized corpus (machine-generated) improves region-based zero-shot accuracy.

Numbers: VCR Q→AR: 28.033.4 (+5.4%); Sherlock: 19.529.7 (+10.2%)

Use self-distillation plus asymmetric sub-4-bit quantization to get practical 2–3 bit LLMs

0.60
0.60
0.85
1

BitDistiller makes deploying 2–3 bit LLMs practical: it keeps much of reasoning/code accuracy while slashing quantization time and GPU cost, enabling cheaper on-prem or edge inference.

Key finding

BitDistiller yields better language modeling and QA accuracy than prior PTQ and QAT on LLaMA-2-7B.

Numbers: 2-bit g128: MMLU 29.25 vs LLM-QAT 23.62 (Table 1)

Pruning, distillation, and quantization make a small-data African language model much cheaper with small accuracy trade-offs

0.60
0.20
0.70
1

You can make small-data multilingual models practical on constrained hardware: pruning and quantization cut size and latency substantially while keeping most accuracy, enabling on-device NER for African languages.

Key finding

Pruning can cut parameters by ≈60% while keeping usable accuracy.

Numbers: ≈60% model size reduction; average F1 still competitive at 60% sparsity (see §3.3, Table 8)

Teacher Intervention: use the teacher's signals to make ultra-low-bit QAT converge much faster

0.60
0.50
0.70
0

TI cuts fine-tuning compute and time for ultra-low-bit deployment, letting teams ship memory- and compute-cheaper models faster while retaining accuracy.

Key finding

Blocking error propagation with TI flattens the loss surface and enables stable QAT.

Numbers: Hessian eigenvalue magnitudes reduced with ≈10× fewer iterations