Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
You can cut optimizer memory by ~20–25% and still get near-optimizer-level language quality by using low-rank training plus two simple tricks, reducing hardware cost and enabling larger pretraining runs on cheaper GPUs.
Summary TLDR
This paper surveys recent methods for memory- and parameter-efficient LLM pretraining, runs a controlled benchmark on LLaMA models up to 1B parameters on the C4 corpus, and proposes two simple improvements—weight refactorization and momentum reset—that improve low-rank training. Key takeaways: properly tuned full-rank training still gives the best perplexity; adding high-rank updates to low-rank schemes boosts performance; applying the two proposed techniques to low-rank training yields comparable perplexity to memory-efficient optimizers while using roughly 20–25% less optimizer memory on the 1B test case. Code is released.
Problem Statement
Pretraining LLMs uses huge memory for weights, activations and optimizer states. Parameter- and memory-efficient techniques (low-rank adapters, optimizer-state compression) work well for fine-tuning, but it's unclear if they can match full-model pretraining on billions of tokens without large quality loss.
Main Contribution
A compact survey of recent parameter- and memory-efficient pretraining methods (optimizer projections, low-rank and sparse+low-rank factorizations, compression/quantization).
A benchmark comparing Full-Rank, GaLore, Fira, LoRA, Low-Rank, and SLTrain on LLaMA variants (60M–1B) trained on C4, using extensive hyperparameter sweeps.
Two practical improvements—weight refactorization and momentum reset—that speed convergence and improve low-rank pretraining quality.
Open-source code and reproducible benchmarks at https://github.com/OptimAI-Lab/Memory_Efficient_Pretraining.
Key Findings
Full-rank pretraining with proper optimizer and tuning achieves best perplexity in the benchmark.
Adding high-rank components to low-rank schemes improves performance across sizes.
Weight refactorization + momentum reset brings low-rank methods close to optimizer-projection methods while saving memory.
Plain low-rank training can work with careful initialization and tuning, contrary to some prior reports.
Results
Perplexity
Perplexity
Perplexity
Memory
Who Should Care
What To Try In 7 Days
Run the repo's 60M or 130M pretraining on C4 to reproduce Table 3 baselines.
Add momentum reset (every 200 updates) to your AdamW training and compare validation perplexity.
Apply weight refactorization to low-rank factor matrices every ~200 updates and measure convergence speed and memory use.
Optimization Features
Infra Optimization
- BF16 training
- Activation compression/quantization (surveyed)
Model Optimization
- Low-rank factorization (W = BA)
- Sparse + low-rank (SLTrain)
- LoRA
System Optimization
- Gradient projection (GaLore family)
- Block-wise optimizer updates (BAdam, Adam-mini)
Training Optimization
- Weight refactorization (periodic SVD-based rebalance)
- Momentum reset (zero moments every 200 steps)
- Stable-SPAM techniques (gradient normalization, adaptive clipping)
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Benchmarks run only up to 1B parameters; results may not transfer to multi-billion models.
- Single dataset (C4) evaluated; domain shifts not tested.
- Quantization and compression methods were surveyed but not included in the benchmark.
- Hyperparameter search for 1B was limited due to compute constraints.
When Not To Use
- When you can afford full-rank training and need the absolute best perplexity (full-rank Stable-SPAM was best).
- When your workflow requires validation beyond C4 or downstream task finetuning without re-evaluation.
Failure Modes
- Low-rank methods can show unstable training on larger models unless carefully initialized and tuned.
- Projection-based optimizers (GaLore) lose gradient information and may scale worse.
- Momentum reset frequency and refactorization schedule are hyperparameters that can hurt instead of help if mis-set.
Core Entities
Models
- LLaMA
- LoRA
- Low-Rank
- SLTrain
- GaLore
- Fira
- Stable-SPAM
Metrics
- Perplexity
- Memory (GB)
- Parameter count (M)
Datasets
- C4
Benchmarks
- Perplexity

