VeRA shares frozen random matrices and learns tiny scaling vectors to cut finetuning params 10–100× with similar performance

Overview

Decision SnapshotReady For Pilot

VeRA is ready for prototyping and limited production: it was validated across language and vision tasks, reduces stored adapter bytes strongly, and has negligible inference impact, but needs hyperparameter tuning and is presently evaluated on Transformer family models only.

Citations8

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 90%

Production readiness: 80%

Novelty: 70%

Authors

Dawid J. Kopiczko, Tijmen Blankevoort, Yuki M. Asano

Links

Abstract / PDF / Code / Data

Why It Matters For Business

VeRA slashes the bytes required per adapted model (10–100× less) so firms can store many personalized or task-specific adapters on the same GPU. That reduces serving costs, speeds model swap-in, and lowers storage and network bandwidth for model variants.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

VeRA is a parameter-efficient finetuning method that freezes a single pair of random matrices shared across layers and learns small per-layer scaling vectors. This lets you store and swap many adapted models cheaply: on GLUE and vision tasks VeRA matches or beats LoRA while using roughly 10× fewer trainable parameters; on instruction tuning it reduces trainable params by ~100× with similar GPT-4 scores. No extra inference cost because trained vectors can be merged back into model weights.

Problem Statement

Finetuning large Transformer models per task or per user creates a storage bottleneck: LoRA and similar methods cut compute but still require storing many megabytes or gigabytes per adapter. The paper asks: can we reduce the bytes needed per adapted model by orders of magnitude while keeping performance close to LoRA?

Main Contribution

Introduce VeRA: freeze a single pair of random matrices shared across adapted layers and learn tiny scaling vectors per layer.

Show VeRA matches LoRA performance on GLUE and image tasks while using an order of magnitude fewer trainable parameters.

Key Findings

On GLUE (RoBERTa-large), VeRA matches LoRA average dev performance while using ≈13× fewer trainable parameters.

NumbersLoRA 0.8M params avg 87.8 vs VeRA 0.061M params avg 87.8

Practical UseIf you need similar GLUE accuracy, switch to VeRA to cut stored adapter size ~10–20× and keep model quality.

Evidence RefTable 2

On the E2E generation benchmark, VeRA slightly outperforms LoRA while using 3–4× fewer trainable parameters.

NumbersGPT2-Medium BLEU: LoRA 68.9 (0.35M) vs VeRA 70.1 (0.098M)

Practical UseFor sequence-generation tasks, VeRA can improve metric scores and reduce adapter size by several×—use it when storage or per-task memory matters.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GLUE average (RoBERTa-large)	87.8	LoRA 87.8 (0.8M params)	—	GLUE (selected tasks)	Table 2: VeRA 0.061M params avg 87.8	Table 2
GLUE average (RoBERTa-base)	85.2	LoRA 86.6 (0.3M params)	−1.4 avg points vs LoRA	GLUE (selected tasks)	Table 2: VeRA 0.043M params avg 85.2 vs LoRA 0.3M avg 86.6	Table 2

What To Try In 7 Days

Re-run a small finetuning job (GLUE or a custom classification task) replacing LoRA with VeRA to compare adapter size and accuracy.

Instruction-tune a 7B model on a small Alpaca subset with VeRA to verify MT-Bench or internal judge scores and measure GPU memory.

Measure merge-and-serve workflow: merge VeRA vectors into weights and confirm no inference latency change in production testbed.

Optimization Features

Infra Optimization

lower GPU memory for storing many adapters (measured −7.4% in one test)

Model Optimization

merging trained vectors into frozen weights (no extra inference cost)

System Optimization

reduce adapter storage by regenerating frozen matrices from RNG seed

Training Optimization

train only small scaling vectors, freeze shared random matrices

Inference Optimization

no additional inference latency due to weight merge

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://dkopi.github.io/vera

Data URLs

https://huggingface.co/datasets/yahma/alpaca-cleanedGLUE datasetE2E dataset

Risks & Boundaries

Limitations

Evaluated mainly on Transformer models; cross-architecture behavior unknown.

Performance sensitive to initialization and choice of scaling-vector initialization and rank.

When Not To Use

When you need exact parity with full fine-tuning and cannot tolerate even small metric drops.

If your deployment requires per-layer uniqueness of random matrices and you cannot store or reproduce RNG seeds.

Failure Modes

Poor initialization of scaling vectors can collapse adaptation (ablation shows some initializations give large drops).

Omitting either scaling vector (d or b) can substantially reduce performance.

Core Entities

Models

VeRALoRARoBERTa-baseRoBERTa-largeGPT-2 MediumGPT-2 LargeGPT-3 (cited)Llama 7BLlama 13BLlama2 7BLlama2 13BViT-BaseViT-Large

Metrics

AccuracyBLEUNISTMETEORROUGE-LCIDErPearson correlationMatthew's correlationGPU memory (GB)Training time (min)

Datasets

GLUEE2EAlpaca (cleaned)MT-BenchCIFAR100Food101Flowers102RESISC45

Benchmarks

GLUEE2EMT-BenchVicuna Eval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

On GLUE (RoBERTa-large), VeRA matches LoRA average dev performance while using ≈13× fewer trainable parameters.

On the E2E generation benchmark, VeRA slightly outperforms LoRA while using 3–4× fewer trainable parameters.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

FlowerTune: an open leaderboard to benchmark federated fine-tuning of LLMs across NLP, finance, medical and code

Key finding

Recover lost accuracy in corrupted small LMs by training tiny LoRA adapters with synthetic data and logit distillation

Key finding

A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

Key finding

SWIFT: an open-source, one-stop framework to fine-tune, evaluate, quantize and deploy over 550 LLMs and 200+ MLLMs

Key finding

MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

Key finding