Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
2
Why It Matters For Business
LQ-LoRA cuts model storage and finetuning memory, letting teams run or adapt multi-billion-parameter models on fewer/cheaper GPUs and lower hosting costs.
Summary TLDR
LQ-LoRA decomposes each pretrained weight matrix into a fixed, heavily quantized component and a small trainable low-rank component. The paper adds two practical pieces: (1) an ILP-based mixed-precision allocator to pick per-matrix quantization settings under a bit-budget, and (2) an optional Fisher-weighted decomposition using calibration data. Empirically on RoBERTa and LLaMA-2 (7B, 70B) LQ-LoRA outperforms QLoRA/GPTQ-LoRA at similar bit budgets and can compress LLaMA-2-70B to ~2.85 effective bits with modest task degradation and single-GPU finetuning feasibility.
Problem Statement
Large pretrained LMs are costly to finetune and to store. Existing combos of LoRA plus post-training quantization degrade when quantization error is large. The paper asks: can we factor each weight into a quantized part plus a trainable low-rank part, and choose per-layer quantization to hit a memory budget while preserving downstream performance?
Main Contribution
A simple iterative decomposition that writes W ≈ Q + L1 L2 where Q is quantized and fixed and L1,L2 are low-rank and finetuned.
An integer linear program (ILP) that picks mixed quantization configs per matrix to meet an overall bits/parameter target.
A data-aware variant using a diagonal Fisher approximation to weight the decomposition for better initialization.
Empirical results on RoBERTa and LLaMA-2 (7B,70B) showing gains vs. QLoRA and GPTQ-LoRA and enabling aggressive sub-3-bit compression.
Key Findings
LQ-LoRA matches or improves on QLoRA/GPTQ-LoRA at similar average bits/param.
LQ-LoRA enables aggressive compression to sub-3 bits with modest degradation.
Fisher-weighted decomposition improves initialization and downstream results, especially for the 7B model.
The ILP finds non-uniform per-matrix bit allocations that reduce reconstruction error versus uniform quantization.
Performance degrades rapidly below ~2.5 bits.
Results
C4 perplexity (lower better)
WikiText-2 perplexity (lower better)
Accuracy
GLUE average (RoBERTa-Large)
Effective bits per parameter
Who Should Care
What To Try In 7 Days
Run the provided LQ-LoRA code on a 7B model with a 3-bit target to verify memory savings and baseline PPL.
Compute a diagonal Fisher on a small calibration set and compare Fisher vs. non-Fisher LQ initializations.
Use the ILP to produce a mixed-precision plan for your target GPU budget and test one downstream task.
Optimization Features
Infra Optimization
- Reduces model storage (bits/param) to fit larger models on fewer GPUs
Model Optimization
- Low-rank plus quantized decomposition (W ≈ Q + L1L2)
- Fisher-weighted SVD for data-aware factorization
System Optimization
- PyTorch-based mixed-quantization implementation (no CUDA extension needed)
- LoRA
Training Optimization
- LoRA
- ILP finds mixed-precision per-matrix configs to meet bit budgets
Inference Optimization
- Just-in-time dequantization for matmuls via PyTorch dispatch
- Possible to run 70B forward/backward at sub-3 bits on single large GPU
Reproducibility
Code Urls
Data Urls
- C4 (public corpus)
- WikiText-2 (public)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- The decomposition is heuristic; no convergence guarantees.
- Precomputing ILP errors across configs is costly (hours on multiple A100 GPUs).
- Performance drops quickly below ~2.5 bits average.
- Requires that task adaptation is well-captured by low-rank updates (not universal).
- Fisher-weighted option needs backprop on calibration data, which adds cost.
When Not To Use
- If your task needs non-low-rank parameter updates or specialized adapters.
- If you can afford full finetuning and need maximum accuracy.
- If you must push bits/param well below ~2.5 and cannot accept accuracy loss.
Failure Modes
- Aggressive quantization yields large perplexity and downstream degradation (e.g., GSM8K, ARC).
- ILP optimizes reconstruction error not final task metric, so allocations may not maximize downstream accuracy.
- Instruction tuning with some baselines (GPTQ-LoRA) was unstable in authors' experiments.
Core Entities
Models
- LLaMA-2-7B
- LLaMA-2-70B
- RoBERTa-Large
- LoRA
- NormalFloat (NF) quantization
Metrics
- perplexity
- Accuracy
- GLUE average score
- Vicuna pairwise win rate
- bits/parameter
- storage (GB)
Datasets
- C4
- WikiText-2
- OpenAssistant
- GLUE
- MMLU
Benchmarks
- MMLU
- GLUE
- Vicuna-style pairwise evaluation
- HuggingFace Open LLM benchmark (ARC, HellaSwag, TruthfulQA, Winogrande, GSM8K)

