Compress PEFT adapters 8–50x with sparse ternary encoding, often preserving or improving accuracy

Overview

Decision SnapshotReady For Pilot

The method is simple, needs no retraining, and shows consistent storage and latency wins; small models and IA3 adapters may need per-task α tuning and extra engineering to exploit sparse ternary operations.

Citations2

Evidence Strength0.85

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 60%

Authors

Prateek Yadav, Leshem Choshen, Colin Raffel, Mohit Bansal

Links

Abstract / PDF / Code

Why It Matters For Business

ComPEFT slashes adapter size and transfer time so you can host many more task experts per GPU, reduce bandwidth costs, and cut serving latency without retraining.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

ComPEFT compresses task-specific parameter updates (task vectors) by keeping only the top-k signs and representing magnitudes with a single scalar (std*α). No retraining is required. Across T5/T0/LLaMA families (200M–70B), it achieves roughly 8x–50x compression while usually matching or improving accuracy (e.g., +4.16% MMLU on LLaMA‑65B) and cutting download/load latency up to ~32x/25x. Large base models are more compressible and need less α tuning. Some PEFT variants (IA3) and smaller models need care.

Problem Statement

Expert PEFT adapters are growing in number and size. GPU memory limits force frequent swapping of adapters between disk/CPU and GPU, causing high communication overhead and latency. Example: QLoRA adapters for LLaMA‑65B can be ~3.2 GB, making swapping a performance bottleneck for multi-expert serving.

Main Contribution

ComPEFT: a two-step compressor that sparsifies task-vector signs (top‑k) and ternary-quantizes magnitudes to a single scalar scale (α*σ) without retraining.

Show 8x–50x storage compression across PEFT and full fine-tune residuals while preserving or often improving task accuracy, with stronger benefits at larger model scale.

Key Findings

ComPEFT compresses PEFT updates by 8x–50x without retraining.

Numbers8x–50x compression (reported vs 16-bit checkpoints)

Practical UseStore and transmit many more experts in the same memory budget; swapping and deployments become much cheaper and faster.

Evidence RefAbstract; §3 (Tables 1,3,4)

ComPEFT often improves accuracy as model size grows.

NumbersMMLU gains: +0.54% (7B), +1.06% (13B), +3.44% (33B), +4.16% (65B)

Practical UseCompressing large-model adapters can be a net win: try compression first for big models before expensive retraining.

Evidence Ref§3.1 (Table 1, Figure 2)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Compression ratio	8x–50x	original 16-bit checkpoints	—	across T5/T0/LLaMA experiments	ComPEFT reduces adapter sizes between 8x and 50x in reported experiments	Abstract; §3
MMLU improvement (LLaMA family)	+4.16% (65B)	original QLoRA checkpoint	+4.16% vs original on MMLU	MMLU test	LLaMA‑65B compressed adapter yields +4.16% MMLU while being 26x smaller	§3.1 (Table 1)

What To Try In 7 Days

Compress a few LoRA/QLoRA adapters with ComPEFT (top-k signs + α*σ) and measure model size and load time.

Sweep density k (5–50%) and α (0.5–10) on a small validation set; set α=1 for large models ≥13B as a fast default.

Replace adapter downloads in your serving path with Golomb-encoded ComPEFT blobs and measure real-world latency savings.

Optimization Features

Infra Optimization

reduced network bandwidth and CPU→GPU transfer times

Model Optimization

ternary quantization (signs + scalar)sparsification of task-vector signs

System Optimization

Golomb coding for compact storagebinary masks for compute-friendly representations

Training Optimization

no retraining requiredonly scalar α tuned via small validation set

Inference Optimization

smaller adapter transfer reduces cold-start latencybinary-mask / bitwise ops enable faster sparse operations (with engineering)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/prateeky2806/ComPEFT

Risks & Boundaries

Limitations

α (scalar) tuning remains necessary for smaller base models; defaults work better at ≥13B.

ComPEFT can perform poorly compressing IA3 on models with weak zero-shot ability; check per-method sensitivity.

When Not To Use

When base model is small and you cannot afford validation-based α tuning.

When the PEFT method is IA3 on weak zero-shot bases (observed sensitivity).

Failure Modes

Choosing α too small or too large can degrade accuracy, especially at extreme sparsity.

Excessive sparsity (too small k) can break few-shot or low-data tasks on small models.

Core Entities

Models

LLaMA-7BLLaMA-13BLLaMA-33BLLaMA-65BLLaMA2-70BT5-BaseT5-LargeT0-3BBERT-baseBERT-largeRoBERTa-baseRoBERTa-large

Metrics

AccuracyExact Match (EM)Storage size (GB/MB)Download time (s)CPU→GPU load time (ms)

Datasets

MMLUGLUE (7 tasks subset)BBH (BigBench Hard)AlpacaSelf-InstructFLAN-v2OASST1HH-RLHFChip2Longform

Benchmarks

MMLUGLUEBBH

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ComPEFT compresses PEFT updates by 8x–50x without retraining.

ComPEFT often improves accuracy as model size grows.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding