Fine-tune quantized LLMs by updating only quantization scales to save memory and keep fast inference.

Overview

Decision SnapshotReady For Pilot

PEQA is practical: it combines known quantization and PEFT ideas but shows clear memory and size gains across multiple public models and tasks; experiments and numbers support deployment benefits though some hyperparameter tuning may improve accuracy further.

Citations28

Evidence Strength0.85

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 75%

Novelty: 50%

Authors

Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, Dongsoo Lee

Links

Abstract / PDF / Data

Why It Matters For Business

PEQA lets teams fine-tune and serve much larger LLMs on the same hardware by keeping models in low-bit form and only shipping small task-specific scale vectors, cutting memory and inference cost while preserving most performance.

Who Should Care

ML Engineer Engineering Lead CTO

Summary TLDR

PEQA fine-tunes only the quantization scales of a weight-quantized LLM while keeping the integer weight indices frozen. This lets you (a) fine-tune large models under much lower DRAM, (b) keep the model in low-bit form for fast inference, and (c) restore much of the original model quality on language and reasoning tasks. Experiments (up to 65B params) show large model-size drops (e.g., ~130GB -> 33GB for LLaMA-65B at 4-bit) and comparable perplexity and instruction-following performance to full-precision PEFT baselines on evaluated benchmarks.

Problem Statement

Fine-tuning LLMs needs huge memory because pretrained weights remain in high precision. Quantization reduces model size for inference but is not usually compatible with parameter-efficient fine-tuning (PEFT) workflows. The paper asks: can we adapt quantized LLMs efficiently for tasks without restoring full-precision weights and while keeping inference acceleration?

Main Contribution

PEQA method: update only per-channel quantization scales while keeping integer weight indices frozen.

Empirical comparison across QAT, PEFT+PTQ, and PEQA showing PEQA matches or closely trails QAT on perplexity for 3/4-bit weights.

Key Findings

PEQA reduces deployed model size for LLaMA-65B from ~130.6GB to ~33.5GB at 4-bit.

NumbersLoRA model size 130.57GB vs PEQA 33.45GB (Table 4).

Practical UseYou can deploy much larger LLaMA variants on the same hardware by switching to PEQA + 4-bit quantization.

Evidence RefTable 4

PEQA cuts the number of trainable parameters versus LoRA for large models.

NumbersLLaMA-65B learnable params: LoRA 10.49M vs PEQA 6.80M (Table 4).

Practical UseFine-tuning memory for optimizer states drops; you can train with smaller GPU memory or larger batch per card.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Model size (deployment)	33.45GB (PEQA, LLaMA-65B, 4-bit)	130.57GB (LoRA, LLaMA-65B, FP16)	-97.12GB	Table 4	Table 4 reports model sizes for LLaMA-65B: LoRA 130.57GB, PEQA 33.45GB (4-bit).	Table 4
Number of learnable parameters	6.80M (PEQA, LLaMA-65B)	10.49M (LoRA, LLaMA-65B)	-3.69M	Table 4	Table 4 shows LoRA 10.49M vs PEQA 6.80M for LLaMA-65B.	Table 4

What To Try In 7 Days

Apply PEQA 4-bit to an existing large model pipeline to measure DRAM and inference latency reductions.

Run a small instruction-tuning job (Alpaca-style) with PEQA to test whether task accuracy recovers from simple PTQ loss.

Replace LoRA-based fine-tuning on one model with PEQA and compare optimizer memory and deployment size.

Optimization Features

Infra Optimization

enables larger models to run under constrained DRAM footprints

Model Optimization

sub-4-bit weight-only quantizationper-channel scales and integer weight indicesgroup-wise per-channel quantization option

System Optimization

lower peak GPU memory during fine-tuningtask-switching by swapping small scale vectors

Training Optimization

update only quantization scales (no integer weights)RTN (round-to-nearest) initialization for scalesreduced optimizer state due to fewer trainable params

Inference Optimization

keeps low-bit integer weights for fast quantized kernelsreduced DRAM reads in matrix-vector ops

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

Wikitext2PennTreeBankAlpacaMMLU

Risks & Boundaries

Limitations

PEQA does not update integer weight indices, so it may be less flexible than full QAT on some tasks.

RTN initialization and a small hyperparameter search were used; results might improve with more tuning.

When Not To Use

When you need the absolute best possible accuracy and can afford full QAT on full-precision weights.

If you cannot run any quantized inference kernels on your deployment hardware.

Failure Modes

PEQA may underperform if integer weight quantization introduced unrecoverable rounding artifacts for a specific task.

Insufficient tuning (learning rate/epochs) can leave PEQA short of full-precision performance on very large models.

Core Entities

Models

LLaMALLaMA 2GPT-NeoGPT-JOPT

Metrics

PerplexityAccuracyROUGE-L

Datasets

Wikitext2PennTreeBankAlpacaMMLU

Benchmarks

MMLUPIQAHellaSwagARC-ChallengeARC-EasyOpenBookQANatural Instruction

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

PEQA reduces deployed model size for LLaMA-65B from ~130.6GB to ~33.5GB at 4-bit.

PEQA cuts the number of trainable parameters versus LoRA for large models.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding