Finetune 65B LLMs in 2/3/4-bit on a single consumer GPU by combining LoRA and modern quantizers

September 28, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

0

Authors

Junjie Yin, Jiahao Dong, Yingheng Wang, Christopher De Sa, Volodymyr Kuleshov

Links

Abstract / PDF

Why It Matters For Business

ModuLoRA lets teams finetune very large LLMs on commodity GPUs, cutting infrastructure cost and cycle time while preserving task performance.

Summary TLDR

ModuLoRA is a memory-efficient finetuning method that attaches high-precision low-rank adapters (LoRA) to weights stored in low-bit quantized form. It uses a quantizer-agnostic backward pass that re-materializes dequantized weights on the fly, so only one dequantized layer exists in memory at a time. Paired with modern quantizers (OPTQ for 3-bit, QuIP# for 2-bit) it enables finetuning of LLaMA and other open models in 2/3/4-bit precision on consumer GPUs (e.g., 65B on a single 24GB or 48GB GPU), with competitive task performance and much lower memory use than full-precision baselines.

Problem Statement

Finetuning large LLMs normally requires storing full-precision weights in memory, which prevents tuning very large models on consumer GPUs. The paper asks: can we finetune large models using low-bit quantized weights while still training high-quality adapters and keeping memory small?

Main Contribution

ModuLoRA: a quantizer-agnostic method that finetunes LoRA adapters while keeping base weights in low-bit quantized form

An implementation (LLMTools) with CUDA kernels for mixed-precision materialization enabling 2/3/4-bit finetuning on consumer GPUs

Empirical evidence that 2/3/4-bit finetuning with modern quantizers matches or exceeds higher-precision baselines on classification, NLI, summarization, and instruction-following

Key Findings

Run 65B finetuning on a single 24GB GPU in 2-bit precision

Numbers65B finetune in 2-bit on one RTX 3090 24GB (paper claim)

Huge memory reduction vs full-precision LoRA

Numbers65B finetune memory: 21.8 GB (2-bit) vs 360.4 GB (full precision)

Competitive task performance with low-bit finetuning

NumbersText classification: 65B 3-bit accuracy 97.2% ±0.8 vs 8-bit 98.6% ±1.0

Matches or exceeds baselines on MNLI and instruction tasks

NumbersMNLI-m: 65B 2-bit 91.85% ±0.3 (LLMTools) comparable to full/8-bit baselines

New state-of-the-art ROUGE on SAMSum with quantized 65B

NumbersLLAMA-65B 4-bit ModuLoRA reported top ROUGE on SAMSum (paper claim)

Faster finetuning iteration time

NumbersLLMTools (2-bit) 7B step: 0.61s vs LoRA full precision 1.50s (~59% faster)

Results

Accuracy

Value97.2% ±0.8 (LLMTools 3-bit)

Baseline98.6% ±1.0 (Bits&Bytes 8-bit)

Accuracy

Value91.85% ±0.3 (LLMTools 2-bit)

Baseline91.55% ±0.1 (Bits&Bytes 8-bit LLM.int8())

SAMSum ROUGE-1 (LLaMA 65B)

ValueTop ROUGE reported for LLaMA-65B 4-bit (LLMTools)

BaselineGPT-3 / QLoRA / 8-bit LoRA baselines

Finetuning step time (LLaMA 7B)

Value0.61 s/iteration (LLMTools 2-bit)

Baseline1.50 s/iteration (LoRA full precision)

GPU memory to finetune (LLaMA 65B)

Value21.8 GB (LLMTools 2-bit)

Baseline360.4 GB (Full-precision LoRA)

Who Should Care

What To Try In 7 Days

Install LLMTools and reproduce a 7B or 13B ModuLoRA finetune on a single 24GB GPU

Compare memory and step time vs your current LoRA/QLoRA pipeline on a small benchmark

If using 30B–65B models, test OPTQ for 3-bit and QuIP# for 2-bit to see memory savings vs your 8-bit setup

Optimization Features

Infra Optimization

  • Enables finetuning 30B on 24GB and 65B on 24–48GB GPUs, unlocking data-parallel training on single-d

Model Optimization

  • Post-training quantization to 2/3/4 bits (QuIP#, OPTQ)
  • LoRA

System Optimization

  • Custom CUDA kernels for mixed-precision matrix-vector multiplication and materialization
  • Materialize weights in float16 for efficiency

Training Optimization

  • Quantizer-agnostic backward pass that re-dequantizes per-layer during backprop
  • Row or weight materialization to limit high-precision memory to one row/layer

Inference Optimization

  • No trivial adapter fusion with quantized base; inference requires mixed-precision implementation

Reproducibility

Data Urls

  • SAMSum (public)
  • MNLI (public)
  • Alpaca (public)
  • BBH (public)
  • C4 (public)

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Adapters cannot be trivially fused into quantized weights at inference, adding runtime complexity
  • Relies on quality of external quantizers; poor quantizer choice can hurt finetuning
  • Not directly applicable to trillion-parameter models that exceed consumer GPU memory even at 1 bit

When Not To Use

  • If you need adapter-weight fusion for highly optimized inference pipelines
  • If your production system cannot run mixed-precision kernels or custom CUDA code
  • When working with models far beyond 100B–1T params that cannot fit even in extreme quantization

Failure Modes

  • Poor quantizer (round-to-nearest) reduces downstream quality vs advanced quantizers like OPTQ/QuIP#
  • Slower inference due to non-optimized CUDA kernels compared to some baselines
  • If dequantization is expensive on your hardware, training step time improvements may vanish

Core Entities

Models

  • LLaMA (7B/13B/30B/65B)
  • OPT (7B/13B/30B)
  • BLOOM

Metrics

  • Accuracy
  • ROUGE-1/2/L
  • Perplexity
  • Exact match (for BBH)

Datasets

  • SAMSum
  • MNLI
  • Alpaca
  • Code-Alpaca
  • BBH
  • C4
  • Wiki2

Benchmarks

  • SAMSum (summarization)
  • MNLI (natural language inference)
  • BBH (instruction following)