Compress PEFT adapters 8–50x with sparse ternary encoding, often preserving or improving accuracy

November 22, 20238 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

2

Authors

Prateek Yadav, Leshem Choshen, Colin Raffel, Mohit Bansal

Links

Abstract / PDF

Why It Matters For Business

ComPEFT slashes adapter size and transfer time so you can host many more task experts per GPU, reduce bandwidth costs, and cut serving latency without retraining.

Summary TLDR

ComPEFT compresses task-specific parameter updates (task vectors) by keeping only the top-k signs and representing magnitudes with a single scalar (std*α). No retraining is required. Across T5/T0/LLaMA families (200M–70B), it achieves roughly 8x–50x compression while usually matching or improving accuracy (e.g., +4.16% MMLU on LLaMA‑65B) and cutting download/load latency up to ~32x/25x. Large base models are more compressible and need less α tuning. Some PEFT variants (IA3) and smaller models need care.

Problem Statement

Expert PEFT adapters are growing in number and size. GPU memory limits force frequent swapping of adapters between disk/CPU and GPU, causing high communication overhead and latency. Example: QLoRA adapters for LLaMA‑65B can be ~3.2 GB, making swapping a performance bottleneck for multi-expert serving.

Main Contribution

ComPEFT: a two-step compressor that sparsifies task-vector signs (top‑k) and ternary-quantizes magnitudes to a single scalar scale (α*σ) without retraining.

Show 8x–50x storage compression across PEFT and full fine-tune residuals while preserving or often improving task accuracy, with stronger benefits at larger model scale.

Demonstrate practical gains: download and CPU→GPU load times cut by up to ~32x and ~25x, preserved compositional generalization, and improved merging in many cases.

Provide analyses and ablations that isolate sparsification, ternarization, and tuned scalar scaling as key components.

Key Findings

ComPEFT compresses PEFT updates by 8x–50x without retraining.

Numbers8x–50x compression (reported vs 16-bit checkpoints)

ComPEFT often improves accuracy as model size grows.

NumbersMMLU gains: +0.54% (7B), +1.06% (13B), +3.44% (33B), +4.16% (65B)

Storage and transmission times fall by orders of magnitude.

NumbersDownload time reductions up to 32x; CPU→GPU load reductions up to 25x

ComPEFT is Pareto-optimal in storage vs performance among tested PEFT methods.

NumbersComPEFT reduces adapter size >10× with minimal performance loss vs many PEFT baselines

Compressed adapters keep compositional/merging behavior.

NumbersSimilar BBH compositional results; merged models improved in 9/12 scenarios

Results

Compression ratio

Value8x–50x

Baselineoriginal 16-bit checkpoints

MMLU improvement (LLaMA family)

Value+4.16% (65B)

Baselineoriginal QLoRA checkpoint

Latency reduction (download)

Valueup to 32x

Baselineoriginal checkpoint download

Latency reduction (CPU→GPU load)

Valueup to 25x

Baselineoriginal CPU->GPU load time

Full fine-tune compression

Value12x–19x

Baselineuncompressed full-fine-tuned task vectors

Who Should Care

What To Try In 7 Days

Compress a few LoRA/QLoRA adapters with ComPEFT (top-k signs + α*σ) and measure model size and load time.

Sweep density k (5–50%) and α (0.5–10) on a small validation set; set α=1 for large models ≥13B as a fast default.

Replace adapter downloads in your serving path with Golomb-encoded ComPEFT blobs and measure real-world latency savings.

Optimization Features

Infra Optimization

  • reduced network bandwidth and CPU→GPU transfer times

Model Optimization

  • ternary quantization (signs + scalar)
  • sparsification of task-vector signs

System Optimization

  • Golomb coding for compact storage
  • binary masks for compute-friendly representations

Training Optimization

  • no retraining required
  • only scalar α tuned via small validation set

Inference Optimization

  • smaller adapter transfer reduces cold-start latency
  • binary-mask / bitwise ops enable faster sparse operations (with engineering)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • α (scalar) tuning remains necessary for smaller base models; defaults work better at ≥13B.
  • ComPEFT can perform poorly compressing IA3 on models with weak zero-shot ability; check per-method sensitivity.
  • To fully exploit sparse ternary speedups you need custom kernels or implementation effort.
  • No formal theory explaining why compression sometimes improves accuracy; improvements are empirical.

When Not To Use

  • When base model is small and you cannot afford validation-based α tuning.
  • When the PEFT method is IA3 on weak zero-shot bases (observed sensitivity).
  • When you will retrain or further fine-tune adapters after compression.

Failure Modes

  • Choosing α too small or too large can degrade accuracy, especially at extreme sparsity.
  • Excessive sparsity (too small k) can break few-shot or low-data tasks on small models.
  • Incorrect encoding/decoding or implementation bugs can negate compression benefits.

Core Entities

Models

  • LLaMA-7B
  • LLaMA-13B
  • LLaMA-33B
  • LLaMA-65B
  • LLaMA2-70B
  • T5-Base
  • T5-Large
  • T0-3B
  • BERT-base
  • BERT-large
  • RoBERTa-base
  • RoBERTa-large

Metrics

  • Accuracy
  • Exact Match (EM)
  • Storage size (GB/MB)
  • Download time (s)
  • CPU→GPU load time (ms)

Datasets

  • MMLU
  • GLUE (7 tasks subset)
  • BBH (BigBench Hard)
  • Alpaca
  • Self-Instruct
  • FLAN-v2
  • OASST1
  • HH-RLHF
  • Chip2
  • Longform

Benchmarks

  • MMLU
  • GLUE
  • BBH