Cut LoRA adapters down to r×r trainable matrices via SVD — 10–1000x less storage while matching accuracy

May 27, 20247 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.9

Citation Count

3

Authors

Klaudia Bałazy, Mohammadreza Banaei, Karl Aberer, Jacek Tabor

Links

Abstract / PDF

Why It Matters For Business

LoRA-XS lets teams store and deploy many task- or user-specific adapters at tiny cost; this lowers cloud storage and checkpointing expense and enables personalization at scale without extra inference latency.

Summary TLDR

LoRA-XS is a parameter-efficient fine-tuning method that freezes low-rank projection matrices obtained from the pretrained weight SVD and learns only a small r×r matrix R. This makes adapter size independent of model hidden dimension and lets you scale adapter size from one parameter to r^2. Experiments on GLUE, commonsense reasoning (LLaMA2/3), and math (GSM8K, MATH) show LoRA-XS matches or beats LoRA/VeRA while cutting trainable parameters by orders of magnitude (examples: RoBERTa-large LoRA 800K→LoRA-XS 60K; LLaMA3-8B LoRA 57M→LoRA-XS 3.67M). SVD init cost is negligible (<1% of fine-tune time). Code is available.

Problem Statement

Adapters like LoRA reduce tuning cost but still scale with model hidden size, making per-user or per-task checkpoints large and expensive to store. The paper asks: can we make adapters arbitrarily small (down to one parameter) while keeping accuracy and runtime unchanged?

Main Contribution

LoRA-XS: freeze LoRA projection matrices using truncated SVD of pretrained weights and train only a small r×r matrix R.

Show parameter count becomes independent of model hidden size, enabling extreme storage reductions (examples across 7B models).

Extensive empirical evaluation on GLUE, GSM8K, MATH and eight commonsense datasets showing LoRA-XS matches or outperforms LoRA and VeRA.

Ablations and theory linking optimal adaptation subspace to top singular vectors of pretrained weights and practical SVD initialization advice.

Key Findings

Large parameter savings vs LoRA while keeping accuracy

NumbersRoBERTa-large: LoRA 800K → LoRA-XS 60K; GLUE avg 87.82 → 88.69

Order-of-magnitude storage reduction on billion-scale models

NumbersLLaMA3-8B: LoRA 57M → LoRA-XS 3.67M; avg acc 80.8 → 85.3

Better math reasoning with far fewer params

NumbersMistral-7B GSM8K: LoRA-XS 3.67M→70.35% vs LoRA 168M→67.7%

SVD initialization is important and cheap

NumbersSVD init <1% of fine-tuning time (e.g., SST-2: 10.6s SVD vs 7310s fine-tune)

Top singular vectors carry most task-relevant directions

NumbersKeeping 10–25% top singular vectors preserves performance for attention modules in GLUE ablations

Results

GLUE average (RoBERTa-large)

ValueFull FT 88.17; LoRA 87.82 (800K); LoRA-XS 88.69 (60K)

BaselineFull fine-tuning

Commonsense average (LLaMA3-8B)

ValueLoRA 80.8 (57M) → LoRA-XS 85.3 (3.67M)

BaselineLoRA

Accuracy

ValueLoRA 67.7 (168M) → LoRA-XS 70.35 (3.67M)

BaselineLoRA

SVD overhead

ValueSVD init 10.6–19.1s vs full fine-tune 1,215–7,310s

BaselineTotal fine-tune time

Who Should Care

What To Try In 7 Days

Run LoRA-XS on one existing LoRA adapter: compute storage per adapter and compare accuracy.

Add SVD initialization (use top singular vectors) and sweep small ranks (r=4–32) to find a size/accuracy sweet spot.

Measure SVD time once and confirm SVD overhead is <1% of fine-tune time on your hardware.

Optimization Features

Infra Optimization

  • enables many small checkpoints (saves storage and I/O costs)

Model Optimization

  • low-rank adaptation with frozen SVD bases
  • train only an r×r adapter R

System Optimization

  • adapter storage independent of hidden dimension

Training Optimization

  • SVD-based initialization speeds early convergence
  • one-time SVD cost is small relative to training

Inference Optimization

  • no extra inference latency; adapters merge into weights post-training

Reproducibility

Data Urls

  • GLUE
  • GSM8K
  • MATH
  • MetaMathQA
  • BoolQ
  • PIQA
  • SIQA
  • HellaSwag
  • WinoGrande
  • OBQA

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Performance depends on how similar the fine-tuning task is to pretraining; exceptions (e.g., SST-2) exist where SVD init helps less or random init can be better.
  • Very low ranks (extreme compression) cause measurable accuracy drops—output dense layers need higher rank than attention layers.
  • Method requires computing SVD for each adapted weight matrix (one-time cost and storage of frozen bases).

When Not To Use

  • When you can afford full fine-tuning and need to update all weights for a wildly different domain.
  • When extreme hyperparameter stability is critical and you cannot validate SVD vs random initialization per task.
  • When adapter size is not the bottleneck and you prefer simpler adapter schemes.

Failure Modes

  • Accuracy drops if rank r is set too small for output dense layers.
  • Poor initialization or wrong inclusion of singular values can harm convergence for some tasks.
  • Inconsistent gains when task is poorly aligned with pretraining (e.g., sentiment SST-2).

Core Entities

Models

  • RoBERTa-large
  • LLaMA2-7B
  • LLaMA3-8B
  • Mistral-7B
  • Gemma-7B
  • GPT-3 (example)

Metrics

  • Accuracy
  • Matthews correlation
  • Pearson correlation
  • runtime seconds
  • trainable parameter count

Datasets

  • GLUE
  • GSM8K
  • MATH
  • MetaMathQA
  • BoolQ
  • PIQA
  • SIQA
  • HellaSwag
  • WinoGrande
  • OBQA
  • ARC-e
  • ARC-c

Benchmarks

  • GLUE
  • GSM8K
  • MATH
  • Commonsense reasoning (8 datasets)