Split each weight matrix into a fixed quantized part plus a trainable low-rank part to finetune LLMs with sub-3-bit storage.

Overview

Decision SnapshotNeeds Validation

The method integrates established quantization (NF) and LoRA ideas with a new decomposition and ILP; code and multi-model experiments exist, but the algorithm is heuristic and needs ILP precomputation.

Citations2

Evidence Strength0.70

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LQ-LoRA cuts model storage and finetuning memory, letting teams run or adapt multi-billion-parameter models on fewer/cheaper GPUs and lower hosting costs.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

LQ-LoRA decomposes each pretrained weight matrix into a fixed, heavily quantized component and a small trainable low-rank component. The paper adds two practical pieces: (1) an ILP-based mixed-precision allocator to pick per-matrix quantization settings under a bit-budget, and (2) an optional Fisher-weighted decomposition using calibration data. Empirically on RoBERTa and LLaMA-2 (7B, 70B) LQ-LoRA outperforms QLoRA/GPTQ-LoRA at similar bit budgets and can compress LLaMA-2-70B to ~2.85 effective bits with modest task degradation and single-GPU finetuning feasibility.

Problem Statement

Large pretrained LMs are costly to finetune and to store. Existing combos of LoRA plus post-training quantization degrade when quantization error is large. The paper asks: can we factor each weight into a quantized part plus a trainable low-rank part, and choose per-layer quantization to hit a memory budget while preserving downstream performance?

Main Contribution

A simple iterative decomposition that writes W ≈ Q + L1 L2 where Q is quantized and fixed and L1,L2 are low-rank and finetuned.

An integer linear program (ILP) that picks mixed quantization configs per matrix to meet an overall bits/parameter target.

Key Findings

LQ-LoRA matches or improves on QLoRA/GPTQ-LoRA at similar average bits/param.

Numberse.g., 2.75-bit LQ-LoRA (effective 2.85 bits for 70B) gives C4 PPL 6.35 vs uncompressed 6.50 on LLaMA-2-70B

Practical UseUse LQ-LoRA to squeeze more compression with similar language-model quality versus standard QLoRA at the same memory budget.

Evidence RefTable 3, Table 6

LQ-LoRA enables aggressive compression to sub-3 bits with modest degradation.

Numbers2.75-bit config → 2.85 effective bits (70B); model fits in ~27GB GPU memory

Practical UseYou can run or finetune a 70B model on a single ~40GB GPU or store it in ~27GB in some settings, cutting infra costs.

Evidence RefAbstract, Table 3, Figure 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
C4 perplexity (lower better)	6.35 (LQ-LoRA Fisher, 2.75 bits, 70B)	6.50 (uncompressed 16-bit, 70B)	-0.15	C4 validation	Table 6; Section 4.1	Table 6
WikiText-2 perplexity (lower better)	4.32 (LQ-LoRA Fisher, 2.75 bits, 70B)	3.68 (uncompressed 16-bit, 70B)	+0.64	WikiText-2	Table 6; Table 3	Table 6

What To Try In 7 Days

Run the provided LQ-LoRA code on a 7B model with a 3-bit target to verify memory savings and baseline PPL.

Compute a diagonal Fisher on a small calibration set and compare Fisher vs. non-Fisher LQ initializations.

Use the ILP to produce a mixed-precision plan for your target GPU budget and test one downstream task.

Optimization Features

Infra Optimization

Reduces model storage (bits/param) to fit larger models on fewer GPUs

Model Optimization

Low-rank plus quantized decomposition (W ≈ Q + L1L2)Fisher-weighted SVD for data-aware factorization

System Optimization

PyTorch-based mixed-quantization implementation (no CUDA extension needed)LoRA

Training Optimization

LoRAILP finds mixed-precision per-matrix configs to meet bit budgets

Inference Optimization

Just-in-time dequantization for matmuls via PyTorch dispatchPossible to run 70B forward/backward at sub-3 bits on single large GPU

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/HanGuo97/lq-lora

Data URLs

C4 (public corpus)WikiText-2 (public)

Risks & Boundaries

Limitations

The decomposition is heuristic; no convergence guarantees.

Precomputing ILP errors across configs is costly (hours on multiple A100 GPUs).

When Not To Use

If your task needs non-low-rank parameter updates or specialized adapters.

If you can afford full finetuning and need maximum accuracy.

Failure Modes

Aggressive quantization yields large perplexity and downstream degradation (e.g., GSM8K, ARC).

ILP optimizes reconstruction error not final task metric, so allocations may not maximize downstream accuracy.

Core Entities

Models

LLaMA-2-7BLLaMA-2-70BRoBERTa-LargeLoRANormalFloat (NF) quantization

Metrics

perplexityAccuracyGLUE average scoreVicuna pairwise win ratebits/parameterstorage (GB)

Datasets

C4WikiText-2OpenAssistantGLUEMMLU

Benchmarks

MMLUGLUEVicuna-style pairwise evaluationHuggingFace Open LLM benchmark (ARC, HellaSwag, TruthfulQA, Winogrande, GSM8K)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LQ-LoRA matches or improves on QLoRA/GPTQ-LoRA at similar average bits/param.

LQ-LoRA enables aggressive compression to sub-3 bits with modest degradation.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding