Train one quantized LoRA that supports many ranks and fine-tunes Falcon-40b on a single 32GB GPU

February 16, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

2

Authors

Hossein Rajabzadeh, Mojtaba Valipour, Tianshu Zhu, Marzieh Tahaei, Hyock Ju Kwon, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh

Links

Abstract / PDF

Why It Matters For Business

QDyLoRA cuts hardware and iteration cost by producing adapters for many ranks in one quantized fine-tune, letting teams tune large models on smaller GPUs and pick low-rank deployments without retraining.

Summary TLDR

QDyLoRA combines 4-bit quantization (NF4 + double quant) with Dynamic LoRA (rank-dynamic adapters) so a single fine-tune produces adapters usable at many LoRA ranks. The method lets you train large models (e.g., Falcon-40b) on a single 32GB V100 GPU and often finds a much lower optimal rank that matches or beats fixed-rank QLoRA on evaluated benchmarks (MMLU, Web-GLM, GSM8k, TriviaQA). Main trade-offs: quantized training still lags full-precision tuning and limited budget biases updates toward lower ranks.

Problem Statement

Fine-tuning large LLMs needs lots of GPU memory. QLoRA reduces memory via 4-bit quantization but requires a fixed LoRA rank. Searching ranks means retraining many times. Practitioners need a single fine-tune that: (1) fits limited GPU memory, (2) covers multiple LoRA ranks, and (3) finds an effective rank without expensive re-training.

Main Contribution

QDyLoRA: combine Dynamic LoRA (multi-rank adapters) with QLoRA-style 4-bit double quantization so one fine-tune produces adapters usable at ranks 1–64.

Show that a single QDyLoRA run can fine-tune Falcon-40b on one 32GB V100 GPU and then be evaluated across ranks without extra training.

Empirical comparisons across MMLU, Web-GLM, GSM8k, and TriviaQA show QDyLoRA matches or outperforms QLoRA at many ranks, especially lower ranks.

Key Findings

A single QDyLoRA fine-tune produces adapters usable at ranks 1–64 and fits Falcon-40b on one 32GB V100 GPU.

NumbersFine-tuned Falcon-40b for ranks 1–64 on a single 32GB V100 (reported in text).

QDyLoRA often finds a lower rank with equal-or-better task accuracy than QLoRA.

NumbersFalcon-40b on FLAN-v2: QLoRA 58.3 vs QDyLoRA 60.2 (absolute +1.9 points, Table 1).

QDyLoRA gives large gains at low ranks on some benchmarks compared to QLoRA.

NumbersWeb-GLM rank=1: QLoRA 19.9 → QDyLoRA 43.3 (+23.4); GSM8k rank=1: QLoRA 8.9 → QDyLoRA 21.4 (+12.5) (Table 2).

DyLoRA without quantization can OOM on larger models, but QDyLoRA avoids OOM.

NumbersDyLoRA shows OOM with LLaMA2-13b while QDyLoRA runs and reports results (Table 3).

Results

Accuracy

Value60.2

Baseline58.3 (QLoRA)

Web-GLM score (Falcon-40b)

Value43.3 at rank=1 (QDyLoRA)

Baseline19.9 at rank=1 (QLoRA)

GSM8k exact match (Falcon-40b)

Value30.6 at rank=8 (QDyLoRA)

Baseline15.1 at rank=8 (QLoRA)

Ability to avoid OOM

ValueQDyLoRA runs when DyLoRA OOMs

BaselineDyLoRA OOM on LLaMA2-13b

Who Should Care

What To Try In 7 Days

Run one QDyLoRA fine-tune of your target model to produce adapters across ranks, then pick the best rank by validation.

If you have a 32GB GPU, try fine-tuning Falcon-40b with QDyLoRA instead of retraining multiple fixed-rank LoRAs.

Compare low-rank (e.g., 1–8) inference quality vs latency to find cheaper deployment points.

Optimization Features

Infra Optimization

  • enables fine-tuning Falcon-40b on one 32GB V100 GPU

Model Optimization

  • 4-bit NF4 quantization
  • double quantization

System Optimization

  • uses paged optimizers to fit large models on smaller GPUs

Training Optimization

  • LoRA
  • single-run coverage of ranks 1–64

Inference Optimization

  • LoRA
  • dequantize only needed chunks to compute forward

Reproducibility

Data Urls

  • MMLU, GSM8k, Web-GLM, TriviaQA (public benchmarks cited in paper)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Quantized 4-bit fine-tuning does not reach full-precision performance (authors note).
  • Limited training budget biases updates toward lower ranks (authors explain semi-sorted behavior).
  • Paper does not publish code; reproducing exact setup depends on matching QLoRA hyperparameters and paged optimizers.

When Not To Use

  • When you need top-tier full-precision accuracy and cannot accept quantization loss.
  • If you can afford to train many fixed-rank models and prefer separate tuned models per rank.

Failure Modes

  • Performance gap vs full-precision fine-tuning on some tasks.
  • If budget increases to favor high-rank updates, QDyLoRA may need reconfiguration to avoid under-tuning high ranks.
  • Implementation mismatch in quantization or paged optimizer may cause OOM or degraded results.

Core Entities

Models

  • LLaMA-7b
  • LLaMA-13b
  • LLaMA2-13b
  • Falcon-40b

Metrics

  • Accuracy
  • exact match (GSM8k, TriviaQA)
  • BLEU (Web-GLM)

Datasets

  • MMLU
  • Alpaca
  • OASST1
  • Self-Instruct
  • FLAN-v2
  • Web-GLM
  • GSM8k
  • TriviaQA

Benchmarks

  • MMLU
  • Web-GLM
  • GSM8k
  • TriviaQA