VeRA shares frozen random matrices and learns tiny scaling vectors to cut finetuning params 10–100× with similar performance

October 17, 20238 min

Overview

Decision SnapshotReady For Pilot

VeRA is ready for prototyping and limited production: it was validated across language and vision tasks, reduces stored adapter bytes strongly, and has negligible inference impact, but needs hyperparameter tuning and is presently evaluated on Transformer family models only.

Citations8

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 90%

Production readiness: 80%

Novelty: 70%

Authors

Dawid J. Kopiczko, Tijmen Blankevoort, Yuki M. Asano

Links

Abstract / PDF / Code / Data

Why It Matters For Business

VeRA slashes the bytes required per adapted model (10–100× less) so firms can store many personalized or task-specific adapters on the same GPU. That reduces serving costs, speeds model swap-in, and lowers storage and network bandwidth for model variants.

Who Should Care

Summary TLDR

VeRA is a parameter-efficient finetuning method that freezes a single pair of random matrices shared across layers and learns small per-layer scaling vectors. This lets you store and swap many adapted models cheaply: on GLUE and vision tasks VeRA matches or beats LoRA while using roughly 10× fewer trainable parameters; on instruction tuning it reduces trainable params by ~100× with similar GPT-4 scores. No extra inference cost because trained vectors can be merged back into model weights.

Problem Statement

Finetuning large Transformer models per task or per user creates a storage bottleneck: LoRA and similar methods cut compute but still require storing many megabytes or gigabytes per adapter. The paper asks: can we reduce the bytes needed per adapted model by orders of magnitude while keeping performance close to LoRA?

Main Contribution

Introduce VeRA: freeze a single pair of random matrices shared across adapted layers and learn tiny scaling vectors per layer.

Show VeRA matches LoRA performance on GLUE and image tasks while using an order of magnitude fewer trainable parameters.

Key Findings

On GLUE (RoBERTa-large), VeRA matches LoRA average dev performance while using ≈13× fewer trainable parameters.

NumbersLoRA 0.8M params avg 87.8 vs VeRA 0.061M params avg 87.8

Practical UseIf you need similar GLUE accuracy, switch to VeRA to cut stored adapter size ~10–20× and keep model quality.

Evidence RefTable 2

On the E2E generation benchmark, VeRA slightly outperforms LoRA while using 3–4× fewer trainable parameters.

NumbersGPT2-Medium BLEU: LoRA 68.9 (0.35M) vs VeRA 70.1 (0.098M)

Practical UseFor sequence-generation tasks, VeRA can improve metric scores and reduce adapter size by several×—use it when storage or per-task memory matters.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GLUE average (RoBERTa-large)87.8LoRA 87.8 (0.8M params)GLUE (selected tasks)Table 2: VeRA 0.061M params avg 87.8Table 2
GLUE average (RoBERTa-base)85.2LoRA 86.6 (0.3M params)−1.4 avg points vs LoRAGLUE (selected tasks)Table 2: VeRA 0.043M params avg 85.2 vs LoRA 0.3M avg 86.6Table 2

What To Try In 7 Days

Re-run a small finetuning job (GLUE or a custom classification task) replacing LoRA with VeRA to compare adapter size and accuracy.

Instruction-tune a 7B model on a small Alpaca subset with VeRA to verify MT-Bench or internal judge scores and measure GPU memory.

Measure merge-and-serve workflow: merge VeRA vectors into weights and confirm no inference latency change in production testbed.

Optimization Features

Infra Optimization
lower GPU memory for storing many adapters (measured −7.4% in one test)
Model Optimization
merging trained vectors into frozen weights (no extra inference cost)
System Optimization
reduce adapter storage by regenerating frozen matrices from RNG seed
Training Optimization
train only small scaling vectors, freeze shared random matrices
Inference Optimization
no additional inference latency due to weight merge

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluated mainly on Transformer models; cross-architecture behavior unknown.

Performance sensitive to initialization and choice of scaling-vector initialization and rank.

When Not To Use

When you need exact parity with full fine-tuning and cannot tolerate even small metric drops.

If your deployment requires per-layer uniqueness of random matrices and you cannot store or reproduce RNG seeds.

Failure Modes

Poor initialization of scaling vectors can collapse adaptation (ablation shows some initializations give large drops).

Omitting either scaling vector (d or b) can substantially reduce performance.

Core Entities

Models

VeRALoRARoBERTa-baseRoBERTa-largeGPT-2 MediumGPT-2 LargeGPT-3 (cited)Llama 7BLlama 13BLlama2 7BLlama2 13BViT-BaseViT-Large

Metrics

AccuracyBLEUNISTMETEORROUGE-LCIDErPearson correlationMatthew's correlationGPU memory (GB)Training time (min)

Datasets

GLUEE2EAlpaca (cleaned)MT-BenchCIFAR100Food101Flowers102RESISC45

Benchmarks

GLUEE2EMT-BenchVicuna Eval