Overview
The method is simple, needs no retraining, and shows consistent storage and latency wins; small models and IA3 adapters may need per-task α tuning and extra engineering to exploit sparse ternary operations.
Citations2
Evidence Strength0.85
Confidence0.90
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 80%
Novelty: 60%
Why It Matters For Business
ComPEFT slashes adapter size and transfer time so you can host many more task experts per GPU, reduce bandwidth costs, and cut serving latency without retraining.
Who Should Care
Summary TLDR
ComPEFT compresses task-specific parameter updates (task vectors) by keeping only the top-k signs and representing magnitudes with a single scalar (std*α). No retraining is required. Across T5/T0/LLaMA families (200M–70B), it achieves roughly 8x–50x compression while usually matching or improving accuracy (e.g., +4.16% MMLU on LLaMA‑65B) and cutting download/load latency up to ~32x/25x. Large base models are more compressible and need less α tuning. Some PEFT variants (IA3) and smaller models need care.
Problem Statement
Expert PEFT adapters are growing in number and size. GPU memory limits force frequent swapping of adapters between disk/CPU and GPU, causing high communication overhead and latency. Example: QLoRA adapters for LLaMA‑65B can be ~3.2 GB, making swapping a performance bottleneck for multi-expert serving.
Main Contribution
ComPEFT: a two-step compressor that sparsifies task-vector signs (top‑k) and ternary-quantizes magnitudes to a single scalar scale (α*σ) without retraining.
Show 8x–50x storage compression across PEFT and full fine-tune residuals while preserving or often improving task accuracy, with stronger benefits at larger model scale.
Key Findings
ComPEFT compresses PEFT updates by 8x–50x without retraining.
ComPEFT often improves accuracy as model size grows.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Compression ratio | 8x–50x | original 16-bit checkpoints | — | across T5/T0/LLaMA experiments | ComPEFT reduces adapter sizes between 8x and 50x in reported experiments | Abstract; §3 |
| MMLU improvement (LLaMA family) | +4.16% (65B) | original QLoRA checkpoint | +4.16% vs original on MMLU | MMLU test | LLaMA‑65B compressed adapter yields +4.16% MMLU while being 26x smaller | §3.1 (Table 1) |
What To Try In 7 Days
Compress a few LoRA/QLoRA adapters with ComPEFT (top-k signs + α*σ) and measure model size and load time.
Sweep density k (5–50%) and α (0.5–10) on a small validation set; set α=1 for large models ≥13B as a fast default.
Replace adapter downloads in your serving path with Golomb-encoded ComPEFT blobs and measure real-world latency savings.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
α (scalar) tuning remains necessary for smaller base models; defaults work better at ≥13B.
ComPEFT can perform poorly compressing IA3 on models with weak zero-shot ability; check per-method sensitivity.
When Not To Use
When base model is small and you cannot afford validation-based α tuning.
When the PEFT method is IA3 on weak zero-shot bases (observed sensitivity).
Failure Modes
Choosing α too small or too large can degrade accuracy, especially at extreme sparsity.
Excessive sparsity (too small k) can break few-shot or low-data tasks on small models.

