Finetune a 65B LLM on a single 48GB GPU by training 4-bit models with adapters

May 23, 20239 min

Overview

Production Readiness

0.8

Novelty Score

0.8

Cost Impact Score

0.9

Citation Count

485

Authors

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

Links

Abstract / PDF

Why It Matters For Business

QLoRA drastically lowers hardware cost and complexity for finetuning large LLMs, enabling teams to build custom chatbots and models on single consumer or pro GPUs and therefore speed development, lower cloud spend, and protect data privacy.

Summary TLDR

QLoRA is a finetuning method that stores a frozen base model in 4-bit (using a new NF4 format), backpropagates through it into LoRA adapters, and uses double quantization plus paged optimizers to fit 33B models on 24GB and 65B models on 48GB GPUs. The authors release the Guanaco family of models and show near-ChatGPT performance on the Vicuna benchmark while matching 16-bit finetuning on standard tasks.

Problem Statement

Finetuning very large pretrained language models requires huge GPU memory (e.g., >780GB for a 65B model in 16-bit), putting large-model finetuning out of reach for most teams. Prior quantization methods worked for inference but broke training.

Main Contribution

QLoRA: backpropagate through a frozen 4-bit quantized base model into Low-Rank Adapters (LoRA) so only adapters need full gradients

NF4: a 4-bit NormalFloat data type optimized for normally distributed weights

Double Quantization: quantize quantization constants to reduce memory for quantization metadata

Paged Optimizers: use unified memory paging to avoid optimizer state OOM spikes

Large-scale study and open release of Guanaco models and code, with human and GPT-4 evaluations

Key Findings

QLoRA reduces the memory needed to finetune a 65B model from more than 780 GB to under 48 GB

Numbers>780 GB -> <48 GB

Guanaco 65B reaches near ChatGPT quality on the Vicuna benchmark

Numbers99.3% of ChatGPT performance on Vicuna (Table 6)

NF4 with Double Quantization gives measurably better language-model quality than other 4-bit formats

NumbersMean PPL on common crawl: Int4=34.34, FP4≈29.48, NFloat4+DQ=27.41 (Table 2)

4-bit QLORA with NF4 matches 16-bit full finetuning and 16-bit LoRA across benchmarks

NumbersMean 5-shot MMLU accuracy: NFloat4+DQ mean 53.1 vs BFloat16 mean 53.0 (Table 4)

High-quality small datasets can beat much larger but lower-quality datasets for instruction finetuning

NumbersOASST1 (≈9k examples) outperformed subsampled larger datasets like FLAN v2 on chatbot metrics (Section 1 & 5)

Results

GPU memory needed to finetune 65B model

Value<48 GB

Baseline>780 GB (16-bit full finetuning)

Vicuna score relative to ChatGPT (GPT-3.5) evaluated by GPT-4

Value99.3%

BaselineChatGPT (100%)

Elo rating (tournament judged by humans/GPT-4)

Value1022 ±1 (Guanaco 65B)

Baseline1348 ±1 (GPT-4)

Mean perplexity (Pile Common Crawl) by data type

Value27.41 (NFloat4 + DQ)

Baseline34.34 (Int4) and 29.48 (Float4 E3M0)

Accuracy

Value53.1 (NFloat4 + DQ)

Baseline53.0 (BFloat16)

Who Should Care

What To Try In 7 Days

Run QLORA finetuning of a 7B LLaMA model on your instruction dataset using NF4 + Double Quantization and LoRA adapters

Integrate bitsandbytes QLORA kernels and test NF4 vs FP4 quantization on a small validation set

Set up GPT-4 based pairwise evaluation and an Elo tournament to cheaply compare finetuned models

Optimization Features

Token Efficiency

  • unchanged

Infra Optimization

  • Single-GPU finetuning for 33B on 24GB, 65B on 48GB

Model Optimization

  • 4-bit quantization
  • NF4
  • Double Quantization
  • LoRA

System Optimization

  • NVIDIA unified memory paging
  • dequantize-to-bf16 for computation

Training Optimization

  • Paged Optimizers
  • Adapter-only gradients
  • Group-by-length batching

Inference Optimization

  • 4-bit inference quantization

Reproducibility

Data Urls

  • OASST1 (https://github.com/LAION-AI/Open-Instruction-Generalist or referenced OpenAssistant repo)
  • FLAN v2 (referenced)

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Did not exhaustively prove QLORA matches full 16-bit finetuning at 33B/65B across all tasks due to resource limits
  • Evaluation relies heavily on Vicuna and OA benchmarks; results may not generalize to other benchmarks (BigBench, RAFT, HELM)
  • Responsible-AI checks are limited; bias evaluation is partial (CrowS only) and behavior under adversarial prompts needs more study
  • Paged optimizer runtime impacts are not fully characterized across all batch/sequence settings

When Not To Use

  • When you require end-to-end full-model updates at native 16-bit precision for research targeted at parameter updates
  • If you need formal guarantees about safety or bias beyond the limited evaluations reported
  • If your infrastructure cannot support unified memory paging or BF16 computation

Failure Modes

  • Models still hallucinate or give confident but incorrect factual answers (observed in qualitative examples)
  • Mathematical reasoning can fail on some problems and provide incorrect steps
  • Adapters sometimes cause inconsistent refusals or leaking of 'secret' tokens under adversarial prompts
  • Automated evaluation (GPT-4) shows order bias and imperfect agreement with humans

Core Entities

Models

  • LoRA
  • Guanaco
  • LLaMA

Metrics

  • Elo
  • Accuracy
  • Perplexity
  • RougeL

Datasets

  • OASST1
  • Alpaca
  • FLAN v2
  • HH-RLHF
  • Self-Instruct
  • Unnatural Instructions
  • Chip2
  • Longform

Benchmarks

  • Vicuna
  • MMLU
  • OA

Context Entities

Models

  • GPT-4
  • ChatGPT
  • Vicuna
  • Alpaca
  • Open Assistant
  • Bard

Metrics

  • Elo
  • RougeL
  • Fleiss κ
  • Kendall Tau

Datasets

  • FLAN v2
  • GLUE
  • Super-NaturalInstructions

Benchmarks

  • MMLU
  • Vicuna