305 papers found

Finetune a 65B LLM on a single 48GB GPU by training 4-bit models with adapters

0.80
0.80
0.90
485

QLoRA drastically lowers hardware cost and complexity for finetuning large LLMs, enabling teams to build custom chatbots and models on single consumer or pro GPUs and therefore speed development, lower cloud spend, and protect data privacy.

Key finding

QLoRA reduces the memory needed to finetune a 65B model from more than 780 GB to under 48 GB

Numbers: >780 GB -> <48 GB

Prune 50–60% of GPT-scale weights in one pass, no retraining, with minor accuracy loss

0.80
0.70
0.90
69

SparseGPT can cut model memory and inference compute roughly in half for massive GPT models, enabling cheaper hosting and faster inference without retraining. Joint sparsity+quantization can match lower-bit storage with better accuracy than pure quantization.

Key finding

Large GPT models can be pruned to 50–60% unstructured sparsity in one shot with little accuracy loss.

Numbers: 5060% sparsity; removes ≈100B weights from OPT-175B/BLOOM-176B

IR-QLoRA: raise accuracy of 2–4 bit LoRA-finetuned LLMs by maximizing information in quantized weights

0.80
0.60
0.75
42

IR-QLoRA cuts model size to 2–4 bits while restoring much of the lost accuracy, enabling cheaper inference and on-device deployment with only tiny extra finetune time and small storage overhead.

Key finding

4-bit LLaMA-7B finetuned with IR-QLoRA on Alpaca reaches 40.8% MMLU vs QLoRA 38.4% and QA-LoRA 39.4%

Numbers: MMLU avg 40.8% (IR-QLoRA) vs 38.4% (QLoRA), +2.4pp

Practical, end-to-end guide to fine-tuning LLMs: pipelines, PEFT, RAG, alignment and deployment

0.70
0.35
0.70
39

Fine-tuning and RAG let you customise LLM behavior and accuracy while controlling cost; PEFT and quantisation let you ship tailored models without enterprise-scale GPU fleets.

Key finding

QLoRA compresses model parameters and enables 4-bit fine-tuning while retaining near-16-bit performance.

Numbers: Reduces to ~5.2 bits/parameter (from 96 bits); ~18x memory reduction

Practical review of quantization, pruning, distillation and low-rank compression for LLMs

0.60
0.30
0.80
37

Compression cuts model memory, cost and inference latency so LLMs can run on fewer GPUs or at lower cloud cost; pick compression that fits your accuracy budget and hardware.

Key finding

Weight-only post-training quantization (e.g., GPTQ) can reduce weight precision to 3 bits with small accuracy loss and large runtime gains.

Numbers: Wikitext-2 perp +0.34; speedup 3.24× (GPTQ, Table 1)

Fine-tune quantized LLMs by updating only quantization scales to save memory and keep fast inference.

0.75
0.50
0.80
28

PEQA lets teams fine-tune and serve much larger LLMs on the same hardware by keeping models in low-bit form and only shipping small task-specific scale vectors, cutting memory and inference cost while preserving most performance.

Key finding

PEQA reduces deployed model size for LLaMA-65B from ~130.6GB to ~33.5GB at 4-bit.

Numbers: LoRA model size 130.57GB vs PEQA 33.45GB (Table 4).

Lossless 3-bit LLM quantization with dense-and-sparse weights

0.80
0.70
0.80
23

SqueezeLLM cuts model storage and single‑request latency by ~2× while keeping near‑FP16 quality, enabling cheaper and faster on‑prem or cloud inference for generative LLMs.

Key finding

3‑bit dense SqueezeLLM on LLaMA‑7B achieves perplexity 7.75 on C4 versus FP16 7.08 and GPTQ 9.55.

Numbers: LLaMA‑7B (3‑bit): SqueezeLLM PPL 7.75, FP16 7.08, GPTQ 9.55

Atom: 4-bit weight+activation quantization that boosts LLM serving throughput up to 7.7× with minimal accuracy loss

0.70
0.60
0.80
23

Atom can multiply token throughput per GPU and shrink KV-cache memory, lowering cloud GPU costs or increasing capacity without major task-accuracy loss.

Key finding

Atom increases end-to-end serving throughput up to 7.73× vs FP16 and 2.53× vs INT8 under similar latency targets.

Numbers: 7.73× vs FP16; 2.53× vs INT8 (paper Figure 10 / §5.3.2)

Fine-tune LLMs directly in low-bit (INT4/INT3/INT2) and deploy the merged quantized model without accuracy loss

0.70
0.40
0.80
21

QA-LoRA lets teams fine-tune large models on far fewer GPUs, produce merged low-bit models, and deploy faster, cheaper INT4 inference without the accuracy loss from post-quantization.

Key finding

QA-LoRA improves few-shot MMLU accuracy vs QLoRA+GPTQ on LLaMA-7B (Alpaca, 4-bit) in the reported experiments.

Numbers: 5-shot avg: QA-LoRA 39.4% vs QLoRA+GPTQ 36.0% (Table 1)

Cluster activation channels by range, reorder them, then quantize—cuts LLM activation memory up to ~80% while keeping accuracy near FP16

0.70
0.70
0.80
20

RPTQ reduces memory footprint on large language models by up to ~80%, enabling cheaper hosting, fewer GPUs per deployment, and longer context lengths for customers without retraining models.

Key finding

RPTQ enables low-bit activation quantization with small accuracy loss on large OPT models.

Numbers: OPT-175b: W4A8 perplexity loss <0.5; W4A4 loss <3 (evaluated datasets)

Joint quantization + low-rank init (LoftQ) closes the gap between quantized LLM backbones and full fine-tuning, especially at 2-bit

0.70
0.60
0.70
18

LoftQ reduces model storage and training memory while recovering much of full-fine-tuning quality, enabling practical low-bit deployments with low-cost fine-tuning using LoRA adapters.

Key finding

LoftQ closes the initialization gap and outperforms QLoRA on GLUE MNLI (DeBERTaV3, 2-bit uniform).

Numbers: MNLI matched-m: LoftQ 88.0 vs QLoRA 79.9 (2-bit, rank32, Table 2)

Train quantized LLMs without original data and quantize KV cache to reach practical 4-bit weights

0.70
0.60
0.80
15

LLM-QAT lets you reduce model memory and improve throughput by quantizing weights and KV cache to 4 bits while keeping quality close to full precision, which can lower hosting cost and enable longer contexts on the same hardware.

Key finding

Data-free generated samples fine-tuned with logits distillation outperform real-data finetuning for zero-shot tasks.

Numbers: Generated-data (hybrid sampling) avg zero-shot 63.1 vs C4 61.5 (Table 3)

ShortGPT: remove low-impact layers to cut ~25% size while keeping ≈90% of performance

0.60
0.50
0.70
15

ShortGPT offers a low-effort way to cut model size and inference cost by removing low-impact layers; it preserves most classification-style performance and stacks with quantization to save more compute.

Key finding

Removing 10 out of 40 layers (25%) from LLaMA2-13B reduced MMLU from 55.0 to 52.2.

Numbers: MMLU 55.0 -> 52.2 (25% layers removed)

OmniQuant: learnable clipping and equivalent transforms give PTQ QAT-like quality for very low-bit LLM quantization

0.80
0.60
0.80
13

OmniQuant lets teams quantize large models to very low-bit formats with PTQ-level data and time budgets, cutting weight storage and often doubling throughput while keeping runtime identical to standard quantized models.

Key finding

OmniQuant turns catastrophic W2A16 degradation into usable models.

Numbers: LLaMA-13B W2A16 perplexity 13.21 vs GPTQ 3832 (paper text)

Practical guide to compressing Transformers: quantization, pruning, distillation and efficient architectures

0.75
0.40
0.85
13

Compression makes large Transformers affordable to run and store: use post-training quantization and structured pruning for immediate cost and latency gains without full retraining.

Key finding

8-bit and 6-bit post-training quantization often works well, but extreme low-bit (4-bit or below) frequently degrades performance.

Numbers: Table 2: ViT-B top-1 84.54 (FP) vs 8-bit PTQ 76.98, 6-bit PTQ 75.26/81.65 depending on method

A practical survey of compression and speed tricks to run large language models on limited hardware

0.80
0.50
0.85
13

Compression and better kernels let teams run large LLMs on fewer GPUs or even on single workstations, cutting hosting costs and enabling edge/embedded use cases without losing core capabilities.

Key finding

Quantizing FP32 weights to 4-bit cuts model size roughly to one-eighth.

Numbers: ≈1/8 model size when FP32→INT4

How low-bit quantization changes LLaMA3 and a LLaVA MLLM

0.60
0.30
0.70
11

4-bit quantization gives big memory and cost savings with small accuracy loss; ultra-low bits (≤2) are risky for multimodal products and need more work.

Key finding

4-bit post-training quantization keeps quality close to full precision on many tasks

Numbers: ≈2% average drop vs. FP16 on evaluated benchmarks

Compress KV cache to sub-4-bit with <0.1 PPL loss and enable million‑to‑10M token inference

0.70
0.70
0.80
9

Cut KV cache memory 3–7× and preserve accuracy so you can serve much longer contexts on existing GPUs, reducing infrastructure cost or enabling new long-document features.

Key finding

3-bit KV cache with 1% sparse outliers keeps perplexity near fp16 on Wikitext-2

Numbers: LLaMA-7B PPL 5.75 vs fp16 5.68 (+0.07)

Keep a few sensitive weight columns in high precision, quantize the rest to reach ~3 bits with near-4-bit quality and tiny overhead

0.80
0.65
0.82
8

OWQ reduces model storage and keeps accuracy with only tiny runtime and storage overheads, enabling deployment of very large LLMs on fewer GPUs and cheaper hardware.

Key finding

Keeping a small set of sensitive columns in fp16 yields large quality gains over uniform 3-bit quantization.

Numbers: OPT-6.7B WikiText-2: OPTQ 3-bit PPL 12.88 → OWQ 3.01 PPL 11.21

Use FP8 activations and FP4 weights to keep LLM quality while cutting memory and using H100 FP support

0.70
0.60
0.70
8

Switching activations to FP8 and weights to FP4 can cut memory and exploit H100 FP8 hardware while keeping model quality—good for deploying large LLMs on constrained inference servers.

Key finding

FP8 activations beat INT8 activations in perplexity across models, with larger wins for larger models.

Numbers: LLaMA-7b W8A8: PPL 10.63 (INT) → 10.38 (FP); drop 0.25

Compress LLaMA-2 7B to 2.1GB (70% fewer params) with 25% faster inference and ~2–3% accuracy drop

0.60
0.70
0.80
8

CompactifAI can cut model storage and runtime costs, enabling on-prem or cheaper-cloud LLM deployment with modest accuracy trade-offs for many tasks.

Key finding

Memory reduced from 27.1 GB to 2.1 GB (93% reduction) on LlaMA‑2 7B using tensorization plus quantization.

Numbers: 27.1 GB → 2.1 GB (93% reduction)

Microscaling (MX): block-level scales let you run and train models at sub-8-bit with minimal accuracy loss

0.80
0.70
0.80
8

Microscaling cuts memory and compute by moving to narrow, block-scaled formats while keeping model quality close to FP32, enabling cheaper inference and denser training without reengineering training recipes.

Key finding

MXINT8 closely matches FP32 for direct-cast inference across many models.

Numbers: GPT3 ARC easy: FP32 0.744 → MXINT8 0.740 (∆ −0.004)

ZipLM: inference-aware structured pruning that gives runtime speedup guarantees across devices

0.80
0.40
0.80
7

ZipLM cuts inference cost and risk: it produces many valid speedup targets in one run and guarantees measured speedups on target hardware, reducing GPU/CPU time and deployment surprises.

Key finding

ZipLM beats CoFi and TinyBERT on SQuAD at the same speedup.

Numbers: ≈ +3 F1 points vs CoFi at same speedup (SQuAD dev)

INT4 (4-bit) gives big latency wins for encoder models with little accuracy loss, but breaks decoder-only generators; optimized INT4 kernels

0.70
0.40
0.80
6

On Ampere GPUs, INT4 computation can sharply reduce latency and cost for encoder-based workloads (search, classification, embedding). But it is risky to use for autoregressive generation (chatbots, text generation) until activation-quantization problems are solved.

Key finding

Encoder models (BERT) keep accuracy under W4A4 QAT+KD.

Numbers: BERT-base MNLI 84.20 (FP32) → 84.31 (W4A4 symmetric)