IR-QLoRA: raise accuracy of 2–4 bit LoRA-finetuned LLMs by maximizing information in quantized weights

Overview

Decision SnapshotReady For Pilot

Experiments across LLaMA/LLaMA2 models, multiple datasets, and 2–4 bit quantizers show consistent gains and low overhead; results are limited to the benchmarks and hardware used in the paper.

Citations42

Evidence Strength0.85

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 75%

Production readiness: 80%

Novelty: 60%

Authors

Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno

Links

Abstract / PDF / Code

Why It Matters For Business

IR-QLoRA cuts model size to 2–4 bits while restoring much of the lost accuracy, enabling cheaper inference and on-device deployment with only tiny extra finetune time and small storage overhead.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

IR-QLoRA is a two-part recipe to make extremely low-bit quantized LLMs work better with LoRA finetuning. It (1) searches a per-block calibration constant to maximize the entropy of quantized weights (ICQ) and (2) adds a parameter-free connection so LoRA adapters can access original features (IEC). Across LLaMA/LLaMA2 models (7B–65B) and 2–4 bit settings, IR-QLoRA raises MMLU and commonsense QA accuracy over QLoRA/QA-LoRA while adding negligible storage and a fraction of a percent extra finetune time. Code: github.com/htqin/ir-qlora.

Problem Statement

Post-training quantization of LLMs to 2–4 bits saves memory but cuts accuracy. LoRA finetuning helps but often cannot recover information lost by aggressive quantization. The paper asks: can we calibrate quantizers and let LoRA access original features so low-bit models retain more information and accuracy?

Main Contribution

Information Calibration Quantization (ICQ): per-block search for a calibration constant that maximizes the entropy of quantized weights to retain more information.

Information Elastic Connection (IEC): a parameter-free input propagation for LoRA adapters so they can use original representations without large parameter cost.

Key Findings

4-bit LLaMA-7B finetuned with IR-QLoRA on Alpaca reaches 40.8% MMLU vs QLoRA 38.4% and QA-LoRA 39.4%

NumbersMMLU avg 40.8% (IR-QLoRA) vs 38.4% (QLoRA), +2.4pp

Practical UseIf you deploy 4-bit LLaMA-7B, add ICQ+IEC in the LoRA pipeline to gain ~2.4 percentage points on MMLU versus QLoRA.

Evidence RefTable 1 (LLaMA-7B, Alpaca, 4-bit)

IR-QLoRA narrows ultra-low-bit gap: 2-bit LLaMA-7B finetuned on Flan v2 scores 33.7% vs 34.6% for 16-bit, a 0.9pp gap

Numbers2-bit Flan v2 avg 33.7% (IR-QLoRA) vs 34.6% (16-bit), −0.9pp

Practical UseFor extreme 2-bit deployments, IR-QLoRA can approach full-precision accuracy on evaluated benchmarks; expect only ~1 percentage-point loss in MMLU in these experiments.

Evidence RefTable 9 (2-bit, LLaMA-7B, Flan v2)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	40.8% (IR-QLoRA, LLaMA-7B, 4-bit, Alpaca)	38.4% (QLoRA, same setting)	+2.4pp	MMLU (Alpaca finetune)	Table 1 shows IR-QLoRA 40.8 vs QLoRA 38.4 (LLaMA-7B, 4-bit)	Table 1
Accuracy	33.7% (IR-QLoRA, LLaMA-7B, 2-bit, Flan v2)	34.6% (16-bit LLaMA-7B)	−0.9pp	MMLU (Flan v2 finetune)	Table 9 indicates 2-bit IR-QLoRA 33.7% vs 16-bit 34.6%	Table 9

What To Try In 7 Days

Apply ICQ to your quantized weights (per-block calibration search) before LoRA finetuning.

Add IEC-style parameter-free connections to LoRA adapters so they can access original features.

Compare 4-bit IR-QLoRA vs your current QLoRA baseline on a small validation suite (MMLU-like tasks).

Optimization Features

Infra Optimization

Small extra one-time calibration compute; uses existing A100 GPUs in experiments

Model Optimization

Quantization with entropy-aware calibration (ICQ)NormalFloat and integer quantizer support

System Optimization

Blockwise calibration search cached once; negligible recurring cost

Training Optimization

LoRA

Inference Optimization

Double quantization to reduce stored scaling factorsIEC integrated to avoid added inference cost

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/htqin/ir-qlora

Risks & Boundaries

Limitations

Evaluations focus on MMLU and commonsense QA; other tasks (generation, instruction tuning out of distribution) are less explored.

Calibration search cost grows with range/granularity; authors recommend λ=0.1, n=100 to keep overhead tiny.

When Not To Use

If you cannot afford any extra one-time calibration compute or need zero-change finetune pipelines.

When your deployment quantizer already uses a zero-point scheme that subsumes ICQ and offers similar calibration.

Failure Modes

Insufficient entropy gain when quantization is extremely aggressive and distributional assumptions fail.

Search hyperparameters (λ, n) chosen poorly can waste time or yield suboptimal calibration.

Core Entities

Models

LLaMA (7B,13B,30B,65B)LLaMA2 (7B,13B)

Metrics

Accuracyentropy (of quantized weights)training timemodel size (GB)

Datasets

AlpacaFlan v2MMLUCommonsenseQA (HellaSwag, PIQA, WinoGrande, ARC, BoolQ, OpenBookQA)

Benchmarks

MMLUCommonsenseQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

4-bit LLaMA-7B finetuned with IR-QLoRA on Alpaca reaches 40.8% MMLU vs QLoRA 38.4% and QA-LoRA 39.4%

IR-QLoRA narrows ultra-low-bit gap: 2-bit LLaMA-7B finetuned on Flan v2 scores 33.7% vs 34.6% for 16-bit, a 0.9pp gap

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding