Overview
Production Readiness
0.8
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
2
Why It Matters For Business
ComPEFT slashes adapter size and transfer time so you can host many more task experts per GPU, reduce bandwidth costs, and cut serving latency without retraining.
Summary TLDR
ComPEFT compresses task-specific parameter updates (task vectors) by keeping only the top-k signs and representing magnitudes with a single scalar (std*α). No retraining is required. Across T5/T0/LLaMA families (200M–70B), it achieves roughly 8x–50x compression while usually matching or improving accuracy (e.g., +4.16% MMLU on LLaMA‑65B) and cutting download/load latency up to ~32x/25x. Large base models are more compressible and need less α tuning. Some PEFT variants (IA3) and smaller models need care.
Problem Statement
Expert PEFT adapters are growing in number and size. GPU memory limits force frequent swapping of adapters between disk/CPU and GPU, causing high communication overhead and latency. Example: QLoRA adapters for LLaMA‑65B can be ~3.2 GB, making swapping a performance bottleneck for multi-expert serving.
Main Contribution
ComPEFT: a two-step compressor that sparsifies task-vector signs (top‑k) and ternary-quantizes magnitudes to a single scalar scale (α*σ) without retraining.
Show 8x–50x storage compression across PEFT and full fine-tune residuals while preserving or often improving task accuracy, with stronger benefits at larger model scale.
Demonstrate practical gains: download and CPU→GPU load times cut by up to ~32x and ~25x, preserved compositional generalization, and improved merging in many cases.
Provide analyses and ablations that isolate sparsification, ternarization, and tuned scalar scaling as key components.
Key Findings
ComPEFT compresses PEFT updates by 8x–50x without retraining.
ComPEFT often improves accuracy as model size grows.
Storage and transmission times fall by orders of magnitude.
ComPEFT is Pareto-optimal in storage vs performance among tested PEFT methods.
Compressed adapters keep compositional/merging behavior.
Results
Compression ratio
MMLU improvement (LLaMA family)
Latency reduction (download)
Latency reduction (CPU→GPU load)
Full fine-tune compression
Who Should Care
What To Try In 7 Days
Compress a few LoRA/QLoRA adapters with ComPEFT (top-k signs + α*σ) and measure model size and load time.
Sweep density k (5–50%) and α (0.5–10) on a small validation set; set α=1 for large models ≥13B as a fast default.
Replace adapter downloads in your serving path with Golomb-encoded ComPEFT blobs and measure real-world latency savings.
Optimization Features
Infra Optimization
- reduced network bandwidth and CPU→GPU transfer times
Model Optimization
- ternary quantization (signs + scalar)
- sparsification of task-vector signs
System Optimization
- Golomb coding for compact storage
- binary masks for compute-friendly representations
Training Optimization
- no retraining required
- only scalar α tuned via small validation set
Inference Optimization
- smaller adapter transfer reduces cold-start latency
- binary-mask / bitwise ops enable faster sparse operations (with engineering)
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- α (scalar) tuning remains necessary for smaller base models; defaults work better at ≥13B.
- ComPEFT can perform poorly compressing IA3 on models with weak zero-shot ability; check per-method sensitivity.
- To fully exploit sparse ternary speedups you need custom kernels or implementation effort.
- No formal theory explaining why compression sometimes improves accuracy; improvements are empirical.
When Not To Use
- When base model is small and you cannot afford validation-based α tuning.
- When the PEFT method is IA3 on weak zero-shot bases (observed sensitivity).
- When you will retrain or further fine-tune adapters after compression.
Failure Modes
- Choosing α too small or too large can degrade accuracy, especially at extreme sparsity.
- Excessive sparsity (too small k) can break few-shot or low-data tasks on small models.
- Incorrect encoding/decoding or implementation bugs can negate compression benefits.
Core Entities
Models
- LLaMA-7B
- LLaMA-13B
- LLaMA-33B
- LLaMA-65B
- LLaMA2-70B
- T5-Base
- T5-Large
- T0-3B
- BERT-base
- BERT-large
- RoBERTa-base
- RoBERTa-large
Metrics
- Accuracy
- Exact Match (EM)
- Storage size (GB/MB)
- Download time (s)
- CPU→GPU load time (ms)
Datasets
- MMLU
- GLUE (7 tasks subset)
- BBH (BigBench Hard)
- Alpaca
- Self-Instruct
- FLAN-v2
- OASST1
- HH-RLHF
- Chip2
- Longform
Benchmarks
- MMLU
- GLUE
- BBH

