1,042 papers found

Finetune a 65B LLM on a single 48GB GPU by training 4-bit models with adapters

0.80
0.80
0.90
485

QLoRA drastically lowers hardware cost and complexity for finetuning large LLMs, enabling teams to build custom chatbots and models on single consumer or pro GPUs and therefore speed development, lower cloud spend, and protect data privacy.

Key finding

QLoRA reduces the memory needed to finetune a 65B model from more than 780 GB to under 48 GB

Numbers: >780 GB -> <48 GB

MLA + DeepSeekMoE: a 236B MoE LLM with 21B active params, 128K context, 42.5% training savings

0.70
0.70
0.80
97

DeepSeek-V2 shows you can run a very large-capacity model but only activate ~21B params per token, cutting training GPU-hours and inference memory. That lowers operational cost and lets you serve longer contexts or larger batches on the same hardware.

Key finding

DeepSeek-V2 activates far fewer parameters per token while keeping strong accuracy.

Numbers: 236B total / 21B activated params

DeepSeek: scaling recipes and a 2T‑token bilingual pretraining run that yields 7B and 67B models competitive on code, math, and chat

0.70
0.60
0.70
82

The paper gives practical scaling recipes and hyperparameter fits so teams can plan compute, model size, and data investments more predictably; it shows a 67B open model can match or beat larger baselines on code/math when paired with curated bilingual data and alignment.

Key finding

Optimal batch size grows and optimal learning rate falls with compute; fitted power‑law relations give near‑optimal hyperparameters across budgets.

Numbers: near‑optimal region defined as ≤0.25% above min loss; fitted across 1e17–2e19 FLOPs

Task‑agnostic structured pruning for LLMs that cuts params and memory, then recovers in hours with 50K samples

0.70
0.60
0.70
73

You can cut ~20% of an LLM's parameters to save memory and speed without losing most zero‑shot ability, and re-tune it in hours on one GPU with modest public data—much cheaper than retraining or distillation.

Key finding

20% parameter pruning can be recovered to near-original zero-shot accuracy with limited tuning

Numbers: Pruned 20% → average accuracy 60.07; baseline 63.2594.97% retained

Prune 50–60% of GPT-scale weights in one pass, no retraining, with minor accuracy loss

0.80
0.70
0.90
69

SparseGPT can cut model memory and inference compute roughly in half for massive GPT models, enabling cheaper hosting and faster inference without retraining. Joint sparsity+quantization can match lower-bit storage with better accuracy than pure quantization.

Key finding

Large GPT models can be pruned to 50–60% unstructured sparsity in one shot with little accuracy loss.

Numbers: 5060% sparsity; removes ≈100B weights from OPT-175B/BLOOM-176B

Wanda: prune LLM weights by weight magnitude × input-activation norm — no retraining, much faster than prior LLM pruning

0.70
0.60
0.80
52

Wanda makes one-shot LLM pruning cheap and practical: you can cut ~50% of parameters without retraining and with minimal calibration data, saving memory and possibly inference cost while keeping most model quality.

Key finding

Wanda greatly reduces language-modeling loss vs magnitude pruning on LLaMA-7B at 50% sparsity

Numbers: Perplexity 7.26 (Wanda) vs 17.29 (magnitude) on WikiText (LLaMA-7B, 50%)

IR-QLoRA: raise accuracy of 2–4 bit LoRA-finetuned LLMs by maximizing information in quantized weights

0.80
0.60
0.75
42

IR-QLoRA cuts model size to 2–4 bits while restoring much of the lost accuracy, enabling cheaper inference and on-device deployment with only tiny extra finetune time and small storage overhead.

Key finding

4-bit LLaMA-7B finetuned with IR-QLoRA on Alpaca reaches 40.8% MMLU vs QLoRA 38.4% and QA-LoRA 39.4%

Numbers: MMLU avg 40.8% (IR-QLoRA) vs 38.4% (QLoRA), +2.4pp

Jamba: hybrid Transformer + Mamba + MoE that fits long contexts in one 80GB GPU

0.30
0.80
0.70
40

Jamba lets teams process much longer documents on standard GPUs while cutting memory needs and speeding up inference; that reduces infrastructure cost for long-document products and enables new features like huge-context summarization.

Key finding

Hybrid Jamba reduces KV cache for 256K tokens to 4GB.

Numbers: KV cache (256K, 16bit): Jamba 4GB vs Mixtral 32GB vs Llama‑2 128GB

Practical, end-to-end guide to fine-tuning LLMs: pipelines, PEFT, RAG, alignment and deployment

0.70
0.35
0.70
39

Fine-tuning and RAG let you customise LLM behavior and accuracy while controlling cost; PEFT and quantisation let you ship tailored models without enterprise-scale GPU fleets.

Key finding

QLoRA compresses model parameters and enables 4-bit fine-tuning while retaining near-16-bit performance.

Numbers: Reduces to ~5.2 bits/parameter (from 96 bits); ~18x memory reduction

Practical review of quantization, pruning, distillation and low-rank compression for LLMs

0.60
0.30
0.80
37

Compression cuts model memory, cost and inference latency so LLMs can run on fewer GPUs or at lower cloud cost; pick compression that fits your accuracy budget and hardware.

Key finding

Weight-only post-training quantization (e.g., GPTQ) can reduce weight precision to 3 bits with small accuracy loss and large runtime gains.

Numbers: Wikitext-2 perp +0.34; speedup 3.24× (GPTQ, Table 1)

Fine-tune quantized LLMs by updating only quantization scales to save memory and keep fast inference.

0.75
0.50
0.80
28

PEQA lets teams fine-tune and serve much larger LLMs on the same hardware by keeping models in low-bit form and only shipping small task-specific scale vectors, cutting memory and inference cost while preserving most performance.

Key finding

PEQA reduces deployed model size for LLaMA-65B from ~130.6GB to ~33.5GB at 4-bit.

Numbers: LoRA model size 130.57GB vs PEQA 33.45GB (Table 4).

One-stage domain adaptation: turn varied medical corpora into instruction–response pairs and train in a single pass

0.60
0.60
0.50
24

One-stage adaptation simplifies pipelines and reduces costly two-stage tuning while delivering strong domain performance—so teams can build competitive medical models faster with less stage-specific hyperparameter work.

Key finding

One-stage training outperforms conventional two-stage adaption across medical datasets

Numbers: 5.3%–23% relative gains on six datasets (one-stage vs two-stage)

Lossless 3-bit LLM quantization with dense-and-sparse weights

0.80
0.70
0.80
23

SqueezeLLM cuts model storage and single‑request latency by ~2× while keeping near‑FP16 quality, enabling cheaper and faster on‑prem or cloud inference for generative LLMs.

Key finding

3‑bit dense SqueezeLLM on LLaMA‑7B achieves perplexity 7.75 on C4 versus FP16 7.08 and GPTQ 9.55.

Numbers: LLaMA‑7B (3‑bit): SqueezeLLM PPL 7.75, FP16 7.08, GPTQ 9.55

Atom: 4-bit weight+activation quantization that boosts LLM serving throughput up to 7.7× with minimal accuracy loss

0.70
0.60
0.80
23

Atom can multiply token throughput per GPU and shrink KV-cache memory, lowering cloud GPU costs or increasing capacity without major task-accuracy loss.

Key finding

Atom increases end-to-end serving throughput up to 7.73× vs FP16 and 2.53× vs INT8 under similar latency targets.

Numbers: 7.73× vs FP16; 2.53× vs INT8 (paper Figure 10 / §5.3.2)

Fine-tune LLMs directly in low-bit (INT4/INT3/INT2) and deploy the merged quantized model without accuracy loss

0.70
0.40
0.80
21

QA-LoRA lets teams fine-tune large models on far fewer GPUs, produce merged low-bit models, and deploy faster, cheaper INT4 inference without the accuracy loss from post-quantization.

Key finding

QA-LoRA improves few-shot MMLU accuracy vs QLoRA+GPTQ on LLaMA-7B (Alpaca, 4-bit) in the reported experiments.

Numbers: 5-shot avg: QA-LoRA 39.4% vs QLoRA+GPTQ 36.0% (Table 1)

A better visual tokenizer lets language models match or beat diffusion models on ImageNet and video tasks

0.60
0.70
0.60
21

A better visual tokenizer can make language-model pipelines produce higher-quality images/videos with fewer inference steps and offer a new compressed token format that speeds downstream generation and saves bandwidth.

Key finding

On ImageNet 512×512 class-conditional generation, MLM + MAGVIT-v2 achieved FID 1.91 with guidance versus diffusion baseline VDM++ FID 2.65.

Numbers: FID 1.91 vs 2.65 (512×512); 28% relative improvement

Cluster activation channels by range, reorder them, then quantize—cuts LLM activation memory up to ~80% while keeping accuracy near FP16

0.70
0.70
0.80
20

RPTQ reduces memory footprint on large language models by up to ~80%, enabling cheaper hosting, fewer GPUs per deployment, and longer context lengths for customers without retraining models.

Key finding

RPTQ enables low-bit activation quantization with small accuracy loss on large OPT models.

Numbers: OPT-175b: W4A8 perplexity loss <0.5; W4A4 loss <3 (evaluated datasets)

Instruction tuning unlocks Mixture-of-Experts: similar or better accuracy at ~1/3 the compute

0.70
0.50
0.80
20

Combine instruction tuning with MoE to cut runtime compute and costs: MoE models can match or beat dense baselines while using much less per-token FLOPs, so this reduces inference cost without sacrificing accuracy on many English tasks.

Key finding

Instruction tuning increases MoE gains vs dense models.

Numbers: 7.1% absolute gain on MMLU-Direct (avg) for FLAN‑MOE over dense at similar FLOPs

Joint quantization + low-rank init (LoftQ) closes the gap between quantized LLM backbones and full fine-tuning, especially at 2-bit

0.70
0.60
0.70
18

LoftQ reduces model storage and training memory while recovering much of full-fine-tuning quality, enabling practical low-bit deployments with low-cost fine-tuning using LoRA adapters.

Key finding

LoftQ closes the initialization gap and outperforms QLoRA on GLUE MNLI (DeBERTaV3, 2-bit uniform).

Numbers: MNLI matched-m: LoftQ 88.0 vs QLoRA 79.9 (2-bit, rank32, Table 2)

Train quantized LLMs without original data and quantize KV cache to reach practical 4-bit weights

0.70
0.60
0.80
15

LLM-QAT lets you reduce model memory and improve throughput by quantizing weights and KV cache to 4 bits while keeping quality close to full precision, which can lower hosting cost and enable longer contexts on the same hardware.

Key finding

Data-free generated samples fine-tuned with logits distillation outperform real-data finetuning for zero-shot tasks.

Numbers: Generated-data (hybrid sampling) avg zero-shot 63.1 vs C4 61.5 (Table 3)

ShortGPT: remove low-impact layers to cut ~25% size while keeping ≈90% of performance

0.60
0.50
0.70
15

ShortGPT offers a low-effort way to cut model size and inference cost by removing low-impact layers; it preserves most classification-style performance and stacks with quantization to save more compute.

Key finding

Removing 10 out of 40 layers (25%) from LLaMA2-13B reduced MMLU from 55.0 to 52.2.

Numbers: MMLU 55.0 -> 52.2 (25% layers removed)

ReWOO separates planning from fetching evidence to cut repeating prompt tokens and run smaller models

0.70
0.60
0.80
15

ReWOO cuts API token usage and hosting cost by separating planning from tool calls, so multi-step tool-using pipelines can run cheaper and scale with smaller models.

Key finding

ReWOO reduces token use on HotpotQA by about 5× compared to an observation-dependent ALM (ReAct).

Numbers: ReAct 9795.1 tokens vs ReWOO 1986.2 tokens (HotpotQA)

Monarch Mixer: replace attention and MLPs with sub-quadratic GEMM-friendly layers to speed long-context models

0.50
0.70
0.70
14

If you run models with long contexts or want lower parameter cost, M2 can cut compute or model size and improve throughput on many GPUs while keeping accuracy; expect implementation and kernel work before production parity on all hardware.

Key finding

M2-BERT matches BERT-base GLUE while cutting parameters.

Numbers: GLUE 79.9 vs 79.6; −27% params (M2 80M vs BERT 110M)

OmniQuant: learnable clipping and equivalent transforms give PTQ QAT-like quality for very low-bit LLM quantization

0.80
0.60
0.80
13

OmniQuant lets teams quantize large models to very low-bit formats with PTQ-level data and time budgets, cutting weight storage and often doubling throughput while keeping runtime identical to standard quantized models.

Key finding

OmniQuant turns catastrophic W2A16 degradation into usable models.

Numbers: LLaMA-13B W2A16 perplexity 13.21 vs GPTQ 3832 (paper text)