IR-QLoRA: raise accuracy of 2–4 bit LoRA-finetuned LLMs by maximizing information in quantized weights

February 8, 20247 min

Overview

Decision SnapshotReady For Pilot

Experiments across LLaMA/LLaMA2 models, multiple datasets, and 2–4 bit quantizers show consistent gains and low overhead; results are limited to the benchmarks and hardware used in the paper.

Citations42

Evidence Strength0.85

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 75%

Production readiness: 80%

Novelty: 60%

Authors

Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno

Links

Abstract / PDF / Code

Why It Matters For Business

IR-QLoRA cuts model size to 2–4 bits while restoring much of the lost accuracy, enabling cheaper inference and on-device deployment with only tiny extra finetune time and small storage overhead.

Who Should Care

Summary TLDR

IR-QLoRA is a two-part recipe to make extremely low-bit quantized LLMs work better with LoRA finetuning. It (1) searches a per-block calibration constant to maximize the entropy of quantized weights (ICQ) and (2) adds a parameter-free connection so LoRA adapters can access original features (IEC). Across LLaMA/LLaMA2 models (7B–65B) and 2–4 bit settings, IR-QLoRA raises MMLU and commonsense QA accuracy over QLoRA/QA-LoRA while adding negligible storage and a fraction of a percent extra finetune time. Code: github.com/htqin/ir-qlora.

Problem Statement

Post-training quantization of LLMs to 2–4 bits saves memory but cuts accuracy. LoRA finetuning helps but often cannot recover information lost by aggressive quantization. The paper asks: can we calibrate quantizers and let LoRA access original features so low-bit models retain more information and accuracy?

Main Contribution

Information Calibration Quantization (ICQ): per-block search for a calibration constant that maximizes the entropy of quantized weights to retain more information.

Information Elastic Connection (IEC): a parameter-free input propagation for LoRA adapters so they can use original representations without large parameter cost.

Key Findings

4-bit LLaMA-7B finetuned with IR-QLoRA on Alpaca reaches 40.8% MMLU vs QLoRA 38.4% and QA-LoRA 39.4%

NumbersMMLU avg 40.8% (IR-QLoRA) vs 38.4% (QLoRA), +2.4pp

Practical UseIf you deploy 4-bit LLaMA-7B, add ICQ+IEC in the LoRA pipeline to gain ~2.4 percentage points on MMLU versus QLoRA.

Evidence RefTable 1 (LLaMA-7B, Alpaca, 4-bit)

IR-QLoRA narrows ultra-low-bit gap: 2-bit LLaMA-7B finetuned on Flan v2 scores 33.7% vs 34.6% for 16-bit, a 0.9pp gap

Numbers2-bit Flan v2 avg 33.7% (IR-QLoRA) vs 34.6% (16-bit), −0.9pp

Practical UseFor extreme 2-bit deployments, IR-QLoRA can approach full-precision accuracy on evaluated benchmarks; expect only ~1 percentage-point loss in MMLU in these experiments.

Evidence RefTable 9 (2-bit, LLaMA-7B, Flan v2)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy40.8% (IR-QLoRA, LLaMA-7B, 4-bit, Alpaca)38.4% (QLoRA, same setting)+2.4ppMMLU (Alpaca finetune)Table 1 shows IR-QLoRA 40.8 vs QLoRA 38.4 (LLaMA-7B, 4-bit)Table 1
Accuracy33.7% (IR-QLoRA, LLaMA-7B, 2-bit, Flan v2)34.6% (16-bit LLaMA-7B)−0.9ppMMLU (Flan v2 finetune)Table 9 indicates 2-bit IR-QLoRA 33.7% vs 16-bit 34.6%Table 9

What To Try In 7 Days

Apply ICQ to your quantized weights (per-block calibration search) before LoRA finetuning.

Add IEC-style parameter-free connections to LoRA adapters so they can access original features.

Compare 4-bit IR-QLoRA vs your current QLoRA baseline on a small validation suite (MMLU-like tasks).

Optimization Features

Infra Optimization
Small extra one-time calibration compute; uses existing A100 GPUs in experiments
Model Optimization
Quantization with entropy-aware calibration (ICQ)NormalFloat and integer quantizer support
System Optimization
Blockwise calibration search cached once; negligible recurring cost
Training Optimization
LoRA
Inference Optimization
Double quantization to reduce stored scaling factorsIEC integrated to avoid added inference cost

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Evaluations focus on MMLU and commonsense QA; other tasks (generation, instruction tuning out of distribution) are less explored.

Calibration search cost grows with range/granularity; authors recommend λ=0.1, n=100 to keep overhead tiny.

When Not To Use

If you cannot afford any extra one-time calibration compute or need zero-change finetune pipelines.

When your deployment quantizer already uses a zero-point scheme that subsumes ICQ and offers similar calibration.

Failure Modes

Insufficient entropy gain when quantization is extremely aggressive and distributional assumptions fail.

Search hyperparameters (λ, n) chosen poorly can waste time or yield suboptimal calibration.

Core Entities

Models

LLaMA (7B,13B,30B,65B)LLaMA2 (7B,13B)

Metrics

Accuracyentropy (of quantized weights)training timemodel size (GB)

Datasets

AlpacaFlan v2MMLUCommonsenseQA (HellaSwag, PIQA, WinoGrande, ARC, BoolQ, OpenBookQA)

Benchmarks

MMLUCommonsenseQA