143 papers found

Finetune a 65B LLM on a single 48GB GPU by training 4-bit models with adapters

0.80
0.80
0.90
485

QLoRA drastically lowers hardware cost and complexity for finetuning large LLMs, enabling teams to build custom chatbots and models on single consumer or pro GPUs and therefore speed development, lower cloud spend, and protect data privacy.

Key finding

QLoRA reduces the memory needed to finetune a 65B model from more than 780 GB to under 48 GB

Numbers: >780 GB -> <48 GB

DeepSeek: scaling recipes and a 2T‑token bilingual pretraining run that yields 7B and 67B models competitive on code, math, and chat

0.70
0.60
0.70
82

The paper gives practical scaling recipes and hyperparameter fits so teams can plan compute, model size, and data investments more predictably; it shows a 67B open model can match or beat larger baselines on code/math when paired with curated bilingual data and alignment.

Key finding

Optimal batch size grows and optimal learning rate falls with compute; fitted power‑law relations give near‑optimal hyperparameters across budgets.

Numbers: near‑optimal region defined as ≤0.25% above min loss; fitted across 1e17–2e19 FLOPs

IR-QLoRA: raise accuracy of 2–4 bit LoRA-finetuned LLMs by maximizing information in quantized weights

0.80
0.60
0.75
42

IR-QLoRA cuts model size to 2–4 bits while restoring much of the lost accuracy, enabling cheaper inference and on-device deployment with only tiny extra finetune time and small storage overhead.

Key finding

4-bit LLaMA-7B finetuned with IR-QLoRA on Alpaca reaches 40.8% MMLU vs QLoRA 38.4% and QA-LoRA 39.4%

Numbers: MMLU avg 40.8% (IR-QLoRA) vs 38.4% (QLoRA), +2.4pp

Fine-tune quantized LLMs by updating only quantization scales to save memory and keep fast inference.

0.75
0.50
0.80
28

PEQA lets teams fine-tune and serve much larger LLMs on the same hardware by keeping models in low-bit form and only shipping small task-specific scale vectors, cutting memory and inference cost while preserving most performance.

Key finding

PEQA reduces deployed model size for LLaMA-65B from ~130.6GB to ~33.5GB at 4-bit.

Numbers: LoRA model size 130.57GB vs PEQA 33.45GB (Table 4).

Fine-tune LLMs directly in low-bit (INT4/INT3/INT2) and deploy the merged quantized model without accuracy loss

0.70
0.40
0.80
21

QA-LoRA lets teams fine-tune large models on far fewer GPUs, produce merged low-bit models, and deploy faster, cheaper INT4 inference without the accuracy loss from post-quantization.

Key finding

QA-LoRA improves few-shot MMLU accuracy vs QLoRA+GPTQ on LLaMA-7B (Alpaca, 4-bit) in the reported experiments.

Numbers: 5-shot avg: QA-LoRA 39.4% vs QLoRA+GPTQ 36.0% (Table 1)

Fine-tune a vision+LLM for medical VQA and report writing using LoRA and one projection layer

0.50
0.60
0.70
11

You can adapt large vision+LLM models to medical VQA and report generation cheaply by tuning small adapter layers; use GPT-4 for scalable semantic QA evaluation instead of brittle lexical metrics.

Key finding

PEFT keeps trainable footprint tiny: only projection + LoRA updated.

Numbers: 56.63M trainable vs 7B full LLM

FS-LLM: an open toolbox to run, benchmark and speed up federated fine‑tuning of LLMs

0.60
0.50
0.70
9

FS-LLM lets organizations co‑train LLMs across private data while cutting bandwidth and memory needs using PEFT and resource operators; this reduces cost and preserves IP when the full model must stay closed.

Key finding

LoRA gives the strongest PEFT results across domains in FL.

Numbers: Fed LLaMA-7B: LoRA 13.29% vs P-tuning 9.71% and prompt 9.63% on Fed-CodeAlpaca (Pass@1)

Compress LLaMA-2 7B to 2.1GB (70% fewer params) with 25% faster inference and ~2–3% accuracy drop

0.60
0.70
0.80
8

CompactifAI can cut model storage and runtime costs, enabling on-prem or cheaper-cloud LLM deployment with modest accuracy trade-offs for many tasks.

Key finding

Memory reduced from 27.1 GB to 2.1 GB (93% reduction) on LlaMA‑2 7B using tensorization plus quantization.

Numbers: 27.1 GB → 2.1 GB (93% reduction)

Microscaling (MX): block-level scales let you run and train models at sub-8-bit with minimal accuracy loss

0.80
0.70
0.80
8

Microscaling cuts memory and compute by moving to narrow, block-scaled formats while keeping model quality close to FP32, enabling cheaper inference and denser training without reengineering training recipes.

Key finding

MXINT8 closely matches FP32 for direct-cast inference across many models.

Numbers: GPT3 ARC easy: FP32 0.744 → MXINT8 0.740 (∆ −0.004)

VeRA shares frozen random matrices and learns tiny scaling vectors to cut finetuning params 10–100× with similar performance

0.80
0.70
0.90
8

VeRA slashes the bytes required per adapted model (10–100× less) so firms can store many personalized or task-specific adapters on the same GPU. That reduces serving costs, speeds model swap-in, and lowers storage and network bandwidth for model variants.

Key finding

On GLUE (RoBERTa-large), VeRA matches LoRA average dev performance while using ≈13× fewer trainable parameters.

Numbers: LoRA 0.8M params avg 87.8 vs VeRA 0.061M params avg 87.8

Benchmarking zeroth-order (no-backprop) optimizers to cut LLM fine-tuning memory and explore practical trade-offs

0.60
0.60
0.70
7

ZO methods let multi-billion LLM fine-tuning run with much lower peak memory—often on a single high-memory GPU—reducing cloud costs and enabling training in constrained environments. But expect accuracy or compute trade-offs on hard tasks.

Key finding

ZO optimizers cut peak memory by a large margin vs standard FO optimizers.

Numbers: ZO-SGD 64 GB vs FO-SGD 148 GB (peak) on OPT-13B/MultiRC

SWIFT: an open-source, one-stop framework to fine-tune, evaluate, quantize and deploy over 550 LLMs and 200+ MLLMs

0.70
0.40
0.70
5

SWIFT unifies fine-tuning, RL-style alignment, quantization and deployment for text and multimodal models. That reduces engineering overhead, accelerates experiments on agents and lets teams run many models and tuners without building custom glue code.

Key finding

SWIFT already supports a very large model and dataset surface.

Numbers: 550+ LLMs, 200+ MLLMs, ~150+ datasets (paper claims)

AdaLink: non-intrusive input adapters that match full fine-tuning on many multimodal tasks

0.70
0.60
0.80
5

AdaLink cuts adaptation cost and serving complexity by tuning tiny adapters instead of full models, letting teams deploy many task-specific behaviors without copying huge models.

Key finding

AdaLink reaches near full fine-tuning on COCO captioning with instruction-tuned base.

Numbers: CIDEr: AdaLink 146.3 vs FT 147.0-0.65)

Cut transformer training losses from variation: module-aware scales, oscillation regularizer, and multi-crop KD for reliable 2–4 bit QAT

0.70
0.60
0.70
4

This paper gives practical steps to train transformers at 2–4 bits with smaller accuracy loss and faster runs, lowering compute cost and hardware area compared to mixed-precision designs.

Key finding

Attention modules (MHSA) are far more sensitive to low-bit quantization than FFNs; value matrices are especially critical.

Numbers: DeiT-T W3A3: All quantized Top-1 68.22% → All except MHSA 71.28% (+3.06)

PeFAD: parameter-efficient federated anomaly detection using pre-trained language models

0.70
0.60
0.70
4

PeFAD lets organizations detect anomalies across distributed sensors without sharing raw data, lowering privacy risk and network cost while improving detection accuracy on real datasets.

Key finding

PeFAD outperforms federated baselines on four real datasets.

Numbers: F1 gains vs federated baselines: 3.83%–28.74% (evaluated datasets)

Train task-focused supervised fine-tuning and preference alignment in parallel, then sparsify and merge adapters to avoid alignment tax.

0.70
0.60
0.60
4

PAFT can preserve both task accuracy and alignment without retraining large models end-to-end; companies can run SFT and alignment in parallel, sparsify adapters, and merge them to ship stronger, aligned models faster.

Key finding

Parallel training (PAFT) plus L1-sparsified SFT improves merged-model scores versus sequential or standalone training on the 6-task Open LLM suite.

Numbers: PAFT (SFTsparse + DPO) avg=0.65243 vs DPO-alone 0.6333 (Mistral-7B)

Share tiny LoRA adapters so heterogeneous clients learn together with far less compute and bandwidth

0.70
0.60
0.80
4

FedLoRA lets federated systems mix different client models while cutting device compute and network usage, enabling FL on diverse hardware without public data.

Key finding

FedLoRA improves average test accuracy over state-of-the-art MHPFL methods on CIFAR-10/100.

Numbers: +1.35% accuracy (best reported on evaluated benchmarks)

MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

0.50
0.40
0.70
3

You can get near‑state performance for some English, Chinese and domain tasks with 1–3B models, cutting training and deployment cost while keeping the ability to adapt to law or finance via targeted fine‑tuning.

Key finding

MindLLM-1.3B outperforms GPT-Neo-1.3B on English MMLU in few-shot evaluation.

Numbers: MMLU 26.6 vs 24.1 (Table 5 / Table 7)

Match expensive re-training by re-warming/decaying the LR plus replay to update LLMs efficiently

0.70
0.45
0.80
3

You can update large LLMs on fresh data at far lower compute cost than full re-training while keeping model quality similar, cutting operational cost and turnaround time for model updates.

Key finding

Re-warming then re-decaying the learning rate is required to adapt well to new pre-training data.

Cut LoRA adapters down to r×r trainable matrices via SVD — 10–1000x less storage while matching accuracy

0.80
0.60
0.90
3

LoRA-XS lets teams store and deploy many task- or user-specific adapters at tiny cost; this lowers cloud storage and checkpointing expense and enables personalization at scale without extra inference latency.

Key finding

Large parameter savings vs LoRA while keeping accuracy

Numbers: RoBERTa-large: LoRA 800K → LoRA-XS 60K; GLUE avg 87.8288.69

One-shot clustering of MLP subunits that preserves NTK to speed up fine-tuning of dense and MoE models

0.70
0.60
0.70
2

MLP Fusion reduces GPU memory and fine-tuning time while preserving training dynamics and near-original accuracy, making low-cost SFT and smaller deployed models feasible for companies running many custom fine-tunes or deploying large MoE models.

Key finding

MLP Fusion yields the lowest NTK approximation error among tested one-shot methods.

Numbers: NTK error on SST2 (RoBERTa first layer): 2826.6 ±155.1 vs SVD 4423.4

Split each weight matrix into a fixed quantized part plus a trainable low-rank part to finetune LLMs with sub-3-bit storage.

0.70
0.60
0.80
2

LQ-LoRA cuts model storage and finetuning memory, letting teams run or adapt multi-billion-parameter models on fewer/cheaper GPUs and lower hosting costs.

Key finding

LQ-LoRA matches or improves on QLoRA/GPTQ-LoRA at similar average bits/param.

Numbers: e.g., 2.75-bit LQ-LoRA (effective 2.85 bits for 70B) gives C4 PPL 6.35 vs uncompressed 6.50 on LLaMA-2-70B

Train one quantized LoRA that supports many ranks and fine-tunes Falcon-40b on a single 32GB GPU

0.60
0.60
0.80
2

QDyLoRA cuts hardware and iteration cost by producing adapters for many ranks in one quantized fine-tune, letting teams tune large models on smaller GPUs and pick low-rank deployments without retraining.

Key finding

A single QDyLoRA fine-tune produces adapters usable at ranks 1–64 and fits Falcon-40b on one 32GB V100 GPU.

Numbers: Fine-tuned Falcon-40b for ranks 164 on a single 32GB V100 (reported in text).

AutoFLIP: Federated hybrid pruning guided by client loss exploration

0.70
0.60
0.80
2

AutoFLIP cuts client compute and bandwidth by tens of percent while often improving accuracy on heterogeneous data, enabling cheaper, faster federated deployments on edge devices.

Key finding

Large accuracy gain on a hard non‑IID task (CIFAR‑100, ResNet).

Numbers: AutoFLIP 0.987 vs FedAvg 0.918+0.069 on CIFAR100 ResNet)