Split each weight matrix into a fixed quantized part plus a trainable low-rank part to finetune LLMs with sub-3-bit storage.

November 20, 20238 min

Overview

Decision SnapshotNeeds Validation

The method integrates established quantization (NF) and LoRA ideas with a new decomposition and ILP; code and multi-model experiments exist, but the algorithm is heuristic and needs ILP precomputation.

Citations2

Evidence Strength0.70

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LQ-LoRA cuts model storage and finetuning memory, letting teams run or adapt multi-billion-parameter models on fewer/cheaper GPUs and lower hosting costs.

Who Should Care

Summary TLDR

LQ-LoRA decomposes each pretrained weight matrix into a fixed, heavily quantized component and a small trainable low-rank component. The paper adds two practical pieces: (1) an ILP-based mixed-precision allocator to pick per-matrix quantization settings under a bit-budget, and (2) an optional Fisher-weighted decomposition using calibration data. Empirically on RoBERTa and LLaMA-2 (7B, 70B) LQ-LoRA outperforms QLoRA/GPTQ-LoRA at similar bit budgets and can compress LLaMA-2-70B to ~2.85 effective bits with modest task degradation and single-GPU finetuning feasibility.

Problem Statement

Large pretrained LMs are costly to finetune and to store. Existing combos of LoRA plus post-training quantization degrade when quantization error is large. The paper asks: can we factor each weight into a quantized part plus a trainable low-rank part, and choose per-layer quantization to hit a memory budget while preserving downstream performance?

Main Contribution

A simple iterative decomposition that writes W ≈ Q + L1 L2 where Q is quantized and fixed and L1,L2 are low-rank and finetuned.

An integer linear program (ILP) that picks mixed quantization configs per matrix to meet an overall bits/parameter target.

Key Findings

LQ-LoRA matches or improves on QLoRA/GPTQ-LoRA at similar average bits/param.

Numberse.g., 2.75-bit LQ-LoRA (effective 2.85 bits for 70B) gives C4 PPL 6.35 vs uncompressed 6.50 on LLaMA-2-70B

Practical UseUse LQ-LoRA to squeeze more compression with similar language-model quality versus standard QLoRA at the same memory budget.

Evidence RefTable 3, Table 6

LQ-LoRA enables aggressive compression to sub-3 bits with modest degradation.

Numbers2.75-bit config → 2.85 effective bits (70B); model fits in ~27GB GPU memory

Practical UseYou can run or finetune a 70B model on a single ~40GB GPU or store it in ~27GB in some settings, cutting infra costs.

Evidence RefAbstract, Table 3, Figure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
C4 perplexity (lower better)6.35 (LQ-LoRA Fisher, 2.75 bits, 70B)6.50 (uncompressed 16-bit, 70B)-0.15C4 validationTable 6; Section 4.1Table 6
WikiText-2 perplexity (lower better)4.32 (LQ-LoRA Fisher, 2.75 bits, 70B)3.68 (uncompressed 16-bit, 70B)+0.64WikiText-2Table 6; Table 3Table 6

What To Try In 7 Days

Run the provided LQ-LoRA code on a 7B model with a 3-bit target to verify memory savings and baseline PPL.

Compute a diagonal Fisher on a small calibration set and compare Fisher vs. non-Fisher LQ initializations.

Use the ILP to produce a mixed-precision plan for your target GPU budget and test one downstream task.

Optimization Features

Infra Optimization
Reduces model storage (bits/param) to fit larger models on fewer GPUs
Model Optimization
Low-rank plus quantized decomposition (W ≈ Q + L1L2)Fisher-weighted SVD for data-aware factorization
System Optimization
PyTorch-based mixed-quantization implementation (no CUDA extension needed)LoRA
Training Optimization
LoRAILP finds mixed-precision per-matrix configs to meet bit budgets
Inference Optimization
Just-in-time dequantization for matmuls via PyTorch dispatchPossible to run 70B forward/backward at sub-3 bits on single large GPU

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

C4 (public corpus)WikiText-2 (public)

Risks & Boundaries

Limitations

The decomposition is heuristic; no convergence guarantees.

Precomputing ILP errors across configs is costly (hours on multiple A100 GPUs).

When Not To Use

If your task needs non-low-rank parameter updates or specialized adapters.

If you can afford full finetuning and need maximum accuracy.

Failure Modes

Aggressive quantization yields large perplexity and downstream degradation (e.g., GSM8K, ARC).

ILP optimizes reconstruction error not final task metric, so allocations may not maximize downstream accuracy.

Core Entities

Models

LLaMA-2-7BLLaMA-2-70BRoBERTa-LargeLoRANormalFloat (NF) quantization

Metrics

perplexityAccuracyGLUE average scoreVicuna pairwise win ratebits/parameterstorage (GB)

Datasets

C4WikiText-2OpenAssistantGLUEMMLU

Benchmarks

MMLUGLUEVicuna-style pairwise evaluationHuggingFace Open LLM benchmark (ARC, HellaSwag, TruthfulQA, Winogrande, GSM8K)