Finetune 65B LLMs in 2/3/4-bit on a single consumer GPU by combining LoRA and modern quantizers

Overview

Decision SnapshotReady For Pilot

Method is implemented and open-sourced with broad experiments on public benchmarks and clear memory/runtime gains; inference fusion and kernel optimization remain engineering work.

Citations0

Evidence Strength0.80

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Junjie Yin, Jiahao Dong, Yingheng Wang, Christopher De Sa, Volodymyr Kuleshov

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ModuLoRA lets teams finetune very large LLMs on commodity GPUs, cutting infrastructure cost and cycle time while preserving task performance.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Founder

Summary TLDR

ModuLoRA is a memory-efficient finetuning method that attaches high-precision low-rank adapters (LoRA) to weights stored in low-bit quantized form. It uses a quantizer-agnostic backward pass that re-materializes dequantized weights on the fly, so only one dequantized layer exists in memory at a time. Paired with modern quantizers (OPTQ for 3-bit, QuIP# for 2-bit) it enables finetuning of LLaMA and other open models in 2/3/4-bit precision on consumer GPUs (e.g., 65B on a single 24GB or 48GB GPU), with competitive task performance and much lower memory use than full-precision baselines.

Problem Statement

Finetuning large LLMs normally requires storing full-precision weights in memory, which prevents tuning very large models on consumer GPUs. The paper asks: can we finetune large models using low-bit quantized weights while still training high-quality adapters and keeping memory small?

Main Contribution

ModuLoRA: a quantizer-agnostic method that finetunes LoRA adapters while keeping base weights in low-bit quantized form

An implementation (LLMTools) with CUDA kernels for mixed-precision materialization enabling 2/3/4-bit finetuning on consumer GPUs

Key Findings

Run 65B finetuning on a single 24GB GPU in 2-bit precision

Numbers65B finetune in 2-bit on one RTX 3090 24GB (paper claim)

Practical UseYou can finetune very large LLaMA models on a single 24GB card using ModuLoRA + QuIP#, removing the need for multi-GPU rigs for many experiments

Evidence RefAbstract, Introduction, Conclusion

Huge memory reduction vs full-precision LoRA

Numbers65B finetune memory: 21.8 GB (2-bit) vs 360.4 GB (full precision)

Practical UseExpect >16x lower GPU memory use vs full-precision; this enables local experimentation and faster iteration

Evidence RefTable 7 (Memory requirements)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	97.2% ±0.8 (LLMTools 3-bit)	98.6% ±1.0 (Bits&Bytes 8-bit)	-1.4 pp	Text classification (5 genres held-out)	Table 1: 65B accuracy 97.2 ±0.8 (3-bit) vs 98.6 ±1.0 (8-bit)	Table 1
Accuracy	91.85% ±0.3 (LLMTools 2-bit)	91.55% ±0.1 (Bits&Bytes 8-bit LLM.int8())	+0.3 pp	MNLI matched test	Table 2: 65B 2-bit 91.85 ±0.3; 8-bit baseline ~91.55 ±0.1	Table 2

What To Try In 7 Days

Install LLMTools and reproduce a 7B or 13B ModuLoRA finetune on a single 24GB GPU

Compare memory and step time vs your current LoRA/QLoRA pipeline on a small benchmark

If using 30B–65B models, test OPTQ for 3-bit and QuIP# for 2-bit to see memory savings vs your 8-bit setup

Optimization Features

Infra Optimization

Enables finetuning 30B on 24GB and 65B on 24–48GB GPUs, unlocking data-parallel training on single-d

Model Optimization

Post-training quantization to 2/3/4 bits (QuIP#, OPTQ)LoRA

System Optimization

Custom CUDA kernels for mixed-precision matrix-vector multiplication and materializationMaterialize weights in float16 for efficiency

Training Optimization

Quantizer-agnostic backward pass that re-dequantizes per-layer during backpropRow or weight materialization to limit high-precision memory to one row/layer

Inference Optimization

No trivial adapter fusion with quantized base; inference requires mixed-precision implementation

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/kuleshov-group/llmtools https://github.com/kuleshov-group/MODULoRA-Experiment

Data URLs

SAMSum (public)MNLI (public)Alpaca (public)BBH (public)C4 (public)

Risks & Boundaries

Limitations

Adapters cannot be trivially fused into quantized weights at inference, adding runtime complexity

Relies on quality of external quantizers; poor quantizer choice can hurt finetuning

When Not To Use

If you need adapter-weight fusion for highly optimized inference pipelines

If your production system cannot run mixed-precision kernels or custom CUDA code

Failure Modes

Poor quantizer (round-to-nearest) reduces downstream quality vs advanced quantizers like OPTQ/QuIP#

Slower inference due to non-optimized CUDA kernels compared to some baselines

Core Entities

Models

LLaMA (7B/13B/30B/65B)OPT (7B/13B/30B)BLOOM

Metrics

AccuracyROUGE-1/2/LPerplexityExact match (for BBH)

Datasets

SAMSumMNLIAlpacaCode-AlpacaBBHC4Wiki2

Benchmarks

SAMSum (summarization)MNLI (natural language inference)BBH (instruction following)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Run 65B finetuning on a single 24GB GPU in 2-bit precision

Huge memory reduction vs full-precision LoRA

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding