Finetune a 65B LLM on a single 48GB GPU by training 4-bit models with adapters

May 23, 20239 min

Overview

Decision SnapshotNeeds Validation

Strong empirical evidence across tasks and scales, open-source code and model releases, and multiple evaluations support high readiness and cost savings. Some limits remain (evaluation biases, limited RLHF comparison, and full-scale 16-bit match at 65B not exhaustively proven).

Citations485

Evidence Strength0.90

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 90%

Production readiness: 80%

Novelty: 80%

Authors

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

Links

Abstract / PDF / Code / Data

Why It Matters For Business

QLoRA drastically lowers hardware cost and complexity for finetuning large LLMs, enabling teams to build custom chatbots and models on single consumer or pro GPUs and therefore speed development, lower cloud spend, and protect data privacy.

Who Should Care

Summary TLDR

QLoRA is a finetuning method that stores a frozen base model in 4-bit (using a new NF4 format), backpropagates through it into LoRA adapters, and uses double quantization plus paged optimizers to fit 33B models on 24GB and 65B models on 48GB GPUs. The authors release the Guanaco family of models and show near-ChatGPT performance on the Vicuna benchmark while matching 16-bit finetuning on standard tasks.

Problem Statement

Finetuning very large pretrained language models requires huge GPU memory (e.g., >780GB for a 65B model in 16-bit), putting large-model finetuning out of reach for most teams. Prior quantization methods worked for inference but broke training.

Main Contribution

QLoRA: backpropagate through a frozen 4-bit quantized base model into Low-Rank Adapters (LoRA) so only adapters need full gradients

NF4: a 4-bit NormalFloat data type optimized for normally distributed weights

Key Findings

QLoRA reduces the memory needed to finetune a 65B model from more than 780 GB to under 48 GB

Numbers>780 GB -> <48 GB

Practical UseYou can finetune 65B open models on a single 48GB GPU instead of requiring multi-server memory.

Evidence RefAbstract / Section 1

Guanaco 65B reaches near ChatGPT quality on the Vicuna benchmark

Numbers99.3% of ChatGPT performance on Vicuna (Table 6)

Practical UseOpen-source models finetuned with QLORA can be competitive with commercial chatbots for many conversational uses.

Evidence RefTable 6, Section 5.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GPU memory needed to finetune 65B model<48 GB>780 GB (16-bit full finetuning)~>732 GB reductionQLoRA reduces average memory requirements from >780GB to <48GB (Abstract, Section 1)Abstract / Section 1
Vicuna score relative to ChatGPT (GPT-3.5) evaluated by GPT-499.3%ChatGPT (100%)-0.7 percentage pointsVicuna prompts (80)Guanaco 65B achieves mean 99.3% of ChatGPT score (Table 6)Table 6

What To Try In 7 Days

Run QLORA finetuning of a 7B LLaMA model on your instruction dataset using NF4 + Double Quantization and LoRA adapters

Integrate bitsandbytes QLORA kernels and test NF4 vs FP4 quantization on a small validation set

Set up GPT-4 based pairwise evaluation and an Elo tournament to cheaply compare finetuned models

Optimization Features

Token Efficiency
unchanged
Infra Optimization
Single-GPU finetuning for 33B on 24GB, 65B on 48GB
Model Optimization
4-bit quantizationNF4Double QuantizationLoRA
System Optimization
NVIDIA unified memory pagingdequantize-to-bf16 for computation
Training Optimization
Paged OptimizersAdapter-only gradientsGroup-by-length batching
Inference Optimization
4-bit inference quantization

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

OASST1 (https://github.com/LAION-AI/Open-Instruction-Generalist or referenced OpenAssistant repo)FLAN v2 (referenced)

Risks & Boundaries

Limitations

Did not exhaustively prove QLORA matches full 16-bit finetuning at 33B/65B across all tasks due to resource limits

Evaluation relies heavily on Vicuna and OA benchmarks; results may not generalize to other benchmarks (BigBench, RAFT, HELM)

When Not To Use

When you require end-to-end full-model updates at native 16-bit precision for research targeted at parameter updates

If you need formal guarantees about safety or bias beyond the limited evaluations reported

Failure Modes

Models still hallucinate or give confident but incorrect factual answers (observed in qualitative examples)

Mathematical reasoning can fail on some problems and provide incorrect steps

Core Entities

Models

LoRAGuanacoLLaMA

Metrics

EloAccuracyPerplexityRougeL

Datasets

OASST1AlpacaFLAN v2HH-RLHFSelf-InstructUnnatural InstructionsChip2Longform

Benchmarks

VicunaMMLUOA

Context Entities

Models

GPT-4ChatGPTVicunaAlpacaOpen AssistantBard

Metrics

EloRougeLFleiss κKendall Tau

Datasets

FLAN v2GLUESuper-NaturalInstructions

Benchmarks

MMLUVicuna