Split each weight matrix into a fixed quantized part plus a trainable low-rank part to finetune LLMs with sub-3-bit storage.

November 20, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

2

Authors

Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim

Links

Abstract / PDF

Why It Matters For Business

LQ-LoRA cuts model storage and finetuning memory, letting teams run or adapt multi-billion-parameter models on fewer/cheaper GPUs and lower hosting costs.

Summary TLDR

LQ-LoRA decomposes each pretrained weight matrix into a fixed, heavily quantized component and a small trainable low-rank component. The paper adds two practical pieces: (1) an ILP-based mixed-precision allocator to pick per-matrix quantization settings under a bit-budget, and (2) an optional Fisher-weighted decomposition using calibration data. Empirically on RoBERTa and LLaMA-2 (7B, 70B) LQ-LoRA outperforms QLoRA/GPTQ-LoRA at similar bit budgets and can compress LLaMA-2-70B to ~2.85 effective bits with modest task degradation and single-GPU finetuning feasibility.

Problem Statement

Large pretrained LMs are costly to finetune and to store. Existing combos of LoRA plus post-training quantization degrade when quantization error is large. The paper asks: can we factor each weight into a quantized part plus a trainable low-rank part, and choose per-layer quantization to hit a memory budget while preserving downstream performance?

Main Contribution

A simple iterative decomposition that writes W ≈ Q + L1 L2 where Q is quantized and fixed and L1,L2 are low-rank and finetuned.

An integer linear program (ILP) that picks mixed quantization configs per matrix to meet an overall bits/parameter target.

A data-aware variant using a diagonal Fisher approximation to weight the decomposition for better initialization.

Empirical results on RoBERTa and LLaMA-2 (7B,70B) showing gains vs. QLoRA and GPTQ-LoRA and enabling aggressive sub-3-bit compression.

Key Findings

LQ-LoRA matches or improves on QLoRA/GPTQ-LoRA at similar average bits/param.

Numberse.g., 2.75-bit LQ-LoRA (effective 2.85 bits for 70B) gives C4 PPL 6.35 vs uncompressed 6.50 on LLaMA-2-70B

LQ-LoRA enables aggressive compression to sub-3 bits with modest degradation.

Numbers2.75-bit config → 2.85 effective bits (70B); model fits in ~27GB GPU memory

Fisher-weighted decomposition improves initialization and downstream results, especially for the 7B model.

NumbersFisher LQ-LoRA (2.75 bits) C4 PPL 6.35 vs unweighted 6.42 on 7B/70B comparisons; larger gains at 7B

The ILP finds non-uniform per-matrix bit allocations that reduce reconstruction error versus uniform quantization.

NumbersMixed-config space |C|≈35; ILP precomputation parallelized across 4 A100s takes a few hours

Performance degrades rapidly below ~2.5 bits.

NumbersAt ~2.5 bits GLUE and C4 metrics drop substantially (see QLoRA ILP rows and low-bit rows)

Results

C4 perplexity (lower better)

Value6.35 (LQ-LoRA Fisher, 2.75 bits, 70B)

Baseline6.50 (uncompressed 16-bit, 70B)

WikiText-2 perplexity (lower better)

Value4.32 (LQ-LoRA Fisher, 2.75 bits, 70B)

Baseline3.68 (uncompressed 16-bit, 70B)

Accuracy

Value0.67 (LQ-LoRA Fisher, 2.75 bits, 70B)

Baseline0.70 (uncompressed 16-bit, 70B)

GLUE average (RoBERTa-Large)

Value87.1 (LQ-LoRA, 2.75 bits)

Baseline88.5 (Full finetune, 16-bit)

Effective bits per parameter

Value2.95 (7B), 2.85 (70B) for 2.75-bit config (includes LoRA storage)

Baseline16 bits (uncompressed)

Who Should Care

What To Try In 7 Days

Run the provided LQ-LoRA code on a 7B model with a 3-bit target to verify memory savings and baseline PPL.

Compute a diagonal Fisher on a small calibration set and compare Fisher vs. non-Fisher LQ initializations.

Use the ILP to produce a mixed-precision plan for your target GPU budget and test one downstream task.

Optimization Features

Infra Optimization

  • Reduces model storage (bits/param) to fit larger models on fewer GPUs

Model Optimization

  • Low-rank plus quantized decomposition (W ≈ Q + L1L2)
  • Fisher-weighted SVD for data-aware factorization

System Optimization

  • PyTorch-based mixed-quantization implementation (no CUDA extension needed)
  • LoRA

Training Optimization

  • LoRA
  • ILP finds mixed-precision per-matrix configs to meet bit budgets

Inference Optimization

  • Just-in-time dequantization for matmuls via PyTorch dispatch
  • Possible to run 70B forward/backward at sub-3 bits on single large GPU

Reproducibility

Data Urls

  • C4 (public corpus)
  • WikiText-2 (public)

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • The decomposition is heuristic; no convergence guarantees.
  • Precomputing ILP errors across configs is costly (hours on multiple A100 GPUs).
  • Performance drops quickly below ~2.5 bits average.
  • Requires that task adaptation is well-captured by low-rank updates (not universal).
  • Fisher-weighted option needs backprop on calibration data, which adds cost.

When Not To Use

  • If your task needs non-low-rank parameter updates or specialized adapters.
  • If you can afford full finetuning and need maximum accuracy.
  • If you must push bits/param well below ~2.5 and cannot accept accuracy loss.

Failure Modes

  • Aggressive quantization yields large perplexity and downstream degradation (e.g., GSM8K, ARC).
  • ILP optimizes reconstruction error not final task metric, so allocations may not maximize downstream accuracy.
  • Instruction tuning with some baselines (GPTQ-LoRA) was unstable in authors' experiments.

Core Entities

Models

  • LLaMA-2-7B
  • LLaMA-2-70B
  • RoBERTa-Large
  • LoRA
  • NormalFloat (NF) quantization

Metrics

  • perplexity
  • Accuracy
  • GLUE average score
  • Vicuna pairwise win rate
  • bits/parameter
  • storage (GB)

Datasets

  • C4
  • WikiText-2
  • OpenAssistant
  • GLUE
  • MMLU

Benchmarks

  • MMLU
  • GLUE
  • Vicuna-style pairwise evaluation
  • HuggingFace Open LLM benchmark (ARC, HellaSwag, TruthfulQA, Winogrande, GSM8K)