Fine-tune quantized LLMs by updating only quantization scales to save memory and keep fast inference.

May 23, 20238 min

Overview

Decision SnapshotReady For Pilot

PEQA is practical: it combines known quantization and PEFT ideas but shows clear memory and size gains across multiple public models and tasks; experiments and numbers support deployment benefits though some hyperparameter tuning may improve accuracy further.

Citations28

Evidence Strength0.85

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 75%

Novelty: 50%

Authors

Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, Dongsoo Lee

Links

Abstract / PDF / Data

Why It Matters For Business

PEQA lets teams fine-tune and serve much larger LLMs on the same hardware by keeping models in low-bit form and only shipping small task-specific scale vectors, cutting memory and inference cost while preserving most performance.

Who Should Care

Summary TLDR

PEQA fine-tunes only the quantization scales of a weight-quantized LLM while keeping the integer weight indices frozen. This lets you (a) fine-tune large models under much lower DRAM, (b) keep the model in low-bit form for fast inference, and (c) restore much of the original model quality on language and reasoning tasks. Experiments (up to 65B params) show large model-size drops (e.g., ~130GB -> 33GB for LLaMA-65B at 4-bit) and comparable perplexity and instruction-following performance to full-precision PEFT baselines on evaluated benchmarks.

Problem Statement

Fine-tuning LLMs needs huge memory because pretrained weights remain in high precision. Quantization reduces model size for inference but is not usually compatible with parameter-efficient fine-tuning (PEFT) workflows. The paper asks: can we adapt quantized LLMs efficiently for tasks without restoring full-precision weights and while keeping inference acceleration?

Main Contribution

PEQA method: update only per-channel quantization scales while keeping integer weight indices frozen.

Empirical comparison across QAT, PEFT+PTQ, and PEQA showing PEQA matches or closely trails QAT on perplexity for 3/4-bit weights.

Key Findings

PEQA reduces deployed model size for LLaMA-65B from ~130.6GB to ~33.5GB at 4-bit.

NumbersLoRA model size 130.57GB vs PEQA 33.45GB (Table 4).

Practical UseYou can deploy much larger LLaMA variants on the same hardware by switching to PEQA + 4-bit quantization.

Evidence RefTable 4

PEQA cuts the number of trainable parameters versus LoRA for large models.

NumbersLLaMA-65B learnable params: LoRA 10.49M vs PEQA 6.80M (Table 4).

Practical UseFine-tuning memory for optimizer states drops; you can train with smaller GPU memory or larger batch per card.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Model size (deployment)33.45GB (PEQA, LLaMA-65B, 4-bit)130.57GB (LoRA, LLaMA-65B, FP16)-97.12GBTable 4Table 4 reports model sizes for LLaMA-65B: LoRA 130.57GB, PEQA 33.45GB (4-bit).Table 4
Number of learnable parameters6.80M (PEQA, LLaMA-65B)10.49M (LoRA, LLaMA-65B)-3.69MTable 4Table 4 shows LoRA 10.49M vs PEQA 6.80M for LLaMA-65B.Table 4

What To Try In 7 Days

Apply PEQA 4-bit to an existing large model pipeline to measure DRAM and inference latency reductions.

Run a small instruction-tuning job (Alpaca-style) with PEQA to test whether task accuracy recovers from simple PTQ loss.

Replace LoRA-based fine-tuning on one model with PEQA and compare optimizer memory and deployment size.

Optimization Features

Infra Optimization
enables larger models to run under constrained DRAM footprints
Model Optimization
sub-4-bit weight-only quantizationper-channel scales and integer weight indicesgroup-wise per-channel quantization option
System Optimization
lower peak GPU memory during fine-tuningtask-switching by swapping small scale vectors
Training Optimization
update only quantization scales (no integer weights)RTN (round-to-nearest) initialization for scalesreduced optimizer state due to fewer trainable params
Inference Optimization
keeps low-bit integer weights for fast quantized kernelsreduced DRAM reads in matrix-vector ops

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

Wikitext2PennTreeBankAlpacaMMLU

Risks & Boundaries

Limitations

PEQA does not update integer weight indices, so it may be less flexible than full QAT on some tasks.

RTN initialization and a small hyperparameter search were used; results might improve with more tuning.

When Not To Use

When you need the absolute best possible accuracy and can afford full QAT on full-precision weights.

If you cannot run any quantized inference kernels on your deployment hardware.

Failure Modes

PEQA may underperform if integer weight quantization introduced unrecoverable rounding artifacts for a specific task.

Insufficient tuning (learning rate/epochs) can leave PEQA short of full-precision performance on very large models.

Core Entities

Models

LLaMALLaMA 2GPT-NeoGPT-JOPT

Metrics

PerplexityAccuracyROUGE-L

Datasets

Wikitext2PennTreeBankAlpacaMMLU

Benchmarks

MMLUPIQAHellaSwagARC-ChallengeARC-EasyOpenBookQANatural Instruction