Survey + benchmark of memory- and parameter-efficient LLM pretraining; two small tricks cut memory ~25% while closing the gap to full-rank

May 28, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Athanasios Glentis, Jiaxiang Li, Qiulin Shang, Andi Han, Ioannis Tsaknakis, Quan Wei, Mingyi Hong

Links

Abstract / PDF

Why It Matters For Business

You can cut optimizer memory by ~20–25% and still get near-optimizer-level language quality by using low-rank training plus two simple tricks, reducing hardware cost and enabling larger pretraining runs on cheaper GPUs.

Summary TLDR

This paper surveys recent methods for memory- and parameter-efficient LLM pretraining, runs a controlled benchmark on LLaMA models up to 1B parameters on the C4 corpus, and proposes two simple improvements—weight refactorization and momentum reset—that improve low-rank training. Key takeaways: properly tuned full-rank training still gives the best perplexity; adding high-rank updates to low-rank schemes boosts performance; applying the two proposed techniques to low-rank training yields comparable perplexity to memory-efficient optimizers while using roughly 20–25% less optimizer memory on the 1B test case. Code is released.

Problem Statement

Pretraining LLMs uses huge memory for weights, activations and optimizer states. Parameter- and memory-efficient techniques (low-rank adapters, optimizer-state compression) work well for fine-tuning, but it's unclear if they can match full-model pretraining on billions of tokens without large quality loss.

Main Contribution

A compact survey of recent parameter- and memory-efficient pretraining methods (optimizer projections, low-rank and sparse+low-rank factorizations, compression/quantization).

A benchmark comparing Full-Rank, GaLore, Fira, LoRA, Low-Rank, and SLTrain on LLaMA variants (60M–1B) trained on C4, using extensive hyperparameter sweeps.

Two practical improvements—weight refactorization and momentum reset—that speed convergence and improve low-rank pretraining quality.

Open-source code and reproducible benchmarks at https://github.com/OptimAI-Lab/Memory_Efficient_Pretraining.

Key Findings

Full-rank pretraining with proper optimizer and tuning achieves best perplexity in the benchmark.

Numbers1B Stable-SPAM full-rank PPL 13.97 (Table 3)

Adding high-rank components to low-rank schemes improves performance across sizes.

Numbers1B Fira PPL 15.10 vs GaLore PPL 15.57 (Table 3)

Weight refactorization + momentum reset brings low-rank methods close to optimizer-projection methods while saving memory.

Numbers1B Low-Rank-Restarts PPL 15.01, Mem 3.66GB vs GaLore/Fira Mem 4.76GB (Table 3)

Plain low-rank training can work with careful initialization and tuning, contrary to some prior reports.

Numbers1B Low-Rank PPL 18.22 (this study) vs prior report 142.53 (Zhao et al.)

Results

Perplexity

Value13.97

Perplexity

Value15.10

Baseline13.97 (full-rank)

Perplexity

Value15.01

Baseline15.10 (Fira)

Memory

Value3.66 GB

Baseline4.76 GB (GaLore/Fira)

Who Should Care

What To Try In 7 Days

Run the repo's 60M or 130M pretraining on C4 to reproduce Table 3 baselines.

Add momentum reset (every 200 updates) to your AdamW training and compare validation perplexity.

Apply weight refactorization to low-rank factor matrices every ~200 updates and measure convergence speed and memory use.

Optimization Features

Infra Optimization

  • BF16 training
  • Activation compression/quantization (surveyed)

Model Optimization

  • Low-rank factorization (W = BA)
  • Sparse + low-rank (SLTrain)
  • LoRA

System Optimization

  • Gradient projection (GaLore family)
  • Block-wise optimizer updates (BAdam, Adam-mini)

Training Optimization

  • Weight refactorization (periodic SVD-based rebalance)
  • Momentum reset (zero moments every 200 steps)
  • Stable-SPAM techniques (gradient normalization, adaptive clipping)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmarks run only up to 1B parameters; results may not transfer to multi-billion models.
  • Single dataset (C4) evaluated; domain shifts not tested.
  • Quantization and compression methods were surveyed but not included in the benchmark.
  • Hyperparameter search for 1B was limited due to compute constraints.

When Not To Use

  • When you can afford full-rank training and need the absolute best perplexity (full-rank Stable-SPAM was best).
  • When your workflow requires validation beyond C4 or downstream task finetuning without re-evaluation.

Failure Modes

  • Low-rank methods can show unstable training on larger models unless carefully initialized and tuned.
  • Projection-based optimizers (GaLore) lose gradient information and may scale worse.
  • Momentum reset frequency and refactorization schedule are hyperparameters that can hurt instead of help if mis-set.

Core Entities

Models

  • LLaMA
  • LoRA
  • Low-Rank
  • SLTrain
  • GaLore
  • Fira
  • Stable-SPAM

Metrics

  • Perplexity
  • Memory (GB)
  • Parameter count (M)

Datasets

  • C4

Benchmarks

  • Perplexity