Compress PEFT adapters 8–50x with sparse ternary encoding, often preserving or improving accuracy

November 22, 20238 min

Overview

Decision SnapshotReady For Pilot

The method is simple, needs no retraining, and shows consistent storage and latency wins; small models and IA3 adapters may need per-task α tuning and extra engineering to exploit sparse ternary operations.

Citations2

Evidence Strength0.85

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 60%

Authors

Prateek Yadav, Leshem Choshen, Colin Raffel, Mohit Bansal

Links

Abstract / PDF / Code

Why It Matters For Business

ComPEFT slashes adapter size and transfer time so you can host many more task experts per GPU, reduce bandwidth costs, and cut serving latency without retraining.

Who Should Care

Summary TLDR

ComPEFT compresses task-specific parameter updates (task vectors) by keeping only the top-k signs and representing magnitudes with a single scalar (std*α). No retraining is required. Across T5/T0/LLaMA families (200M–70B), it achieves roughly 8x–50x compression while usually matching or improving accuracy (e.g., +4.16% MMLU on LLaMA‑65B) and cutting download/load latency up to ~32x/25x. Large base models are more compressible and need less α tuning. Some PEFT variants (IA3) and smaller models need care.

Problem Statement

Expert PEFT adapters are growing in number and size. GPU memory limits force frequent swapping of adapters between disk/CPU and GPU, causing high communication overhead and latency. Example: QLoRA adapters for LLaMA‑65B can be ~3.2 GB, making swapping a performance bottleneck for multi-expert serving.

Main Contribution

ComPEFT: a two-step compressor that sparsifies task-vector signs (top‑k) and ternary-quantizes magnitudes to a single scalar scale (α*σ) without retraining.

Show 8x–50x storage compression across PEFT and full fine-tune residuals while preserving or often improving task accuracy, with stronger benefits at larger model scale.

Key Findings

ComPEFT compresses PEFT updates by 8x–50x without retraining.

Numbers8x50x compression (reported vs 16-bit checkpoints)

Practical UseStore and transmit many more experts in the same memory budget; swapping and deployments become much cheaper and faster.

Evidence RefAbstract; §3 (Tables 1,3,4)

ComPEFT often improves accuracy as model size grows.

NumbersMMLU gains: +0.54% (7B), +1.06% (13B), +3.44% (33B), +4.16% (65B)

Practical UseCompressing large-model adapters can be a net win: try compression first for big models before expensive retraining.

Evidence Ref§3.1 (Table 1, Figure 2)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Compression ratio8x50xoriginal 16-bit checkpointsacross T5/T0/LLaMA experimentsComPEFT reduces adapter sizes between 8x and 50x in reported experimentsAbstract; §3
MMLU improvement (LLaMA family)+4.16% (65B)original QLoRA checkpoint+4.16% vs original on MMLUMMLU testLLaMA‑65B compressed adapter yields +4.16% MMLU while being 26x smaller§3.1 (Table 1)

What To Try In 7 Days

Compress a few LoRA/QLoRA adapters with ComPEFT (top-k signs + α*σ) and measure model size and load time.

Sweep density k (5–50%) and α (0.5–10) on a small validation set; set α=1 for large models ≥13B as a fast default.

Replace adapter downloads in your serving path with Golomb-encoded ComPEFT blobs and measure real-world latency savings.

Optimization Features

Infra Optimization
reduced network bandwidth and CPU→GPU transfer times
Model Optimization
ternary quantization (signs + scalar)sparsification of task-vector signs
System Optimization
Golomb coding for compact storagebinary masks for compute-friendly representations
Training Optimization
no retraining requiredonly scalar α tuned via small validation set
Inference Optimization
smaller adapter transfer reduces cold-start latencybinary-mask / bitwise ops enable faster sparse operations (with engineering)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

α (scalar) tuning remains necessary for smaller base models; defaults work better at ≥13B.

ComPEFT can perform poorly compressing IA3 on models with weak zero-shot ability; check per-method sensitivity.

When Not To Use

When base model is small and you cannot afford validation-based α tuning.

When the PEFT method is IA3 on weak zero-shot bases (observed sensitivity).

Failure Modes

Choosing α too small or too large can degrade accuracy, especially at extreme sparsity.

Excessive sparsity (too small k) can break few-shot or low-data tasks on small models.

Core Entities

Models

LLaMA-7BLLaMA-13BLLaMA-33BLLaMA-65BLLaMA2-70BT5-BaseT5-LargeT0-3BBERT-baseBERT-largeRoBERTa-baseRoBERTa-large

Metrics

AccuracyExact Match (EM)Storage size (GB/MB)Download time (s)CPU→GPU load time (ms)

Datasets

MMLUGLUE (7 tasks subset)BBH (BigBench Hard)AlpacaSelf-InstructFLAN-v2OASST1HH-RLHFChip2Longform

Benchmarks

MMLUGLUEBBH