Survey + benchmark of memory- and parameter-efficient LLM pretraining; two small tricks cut memory ~25% while closing the gap to full-rank

May 28, 20257 min

Overview

Decision SnapshotNeeds Validation

The paper provides controlled benchmarks up to 1B models and clear numeric comparisons; methods are practical but need wider validation at larger scales and on other corpora.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Athanasios Glentis, Jiaxiang Li, Qiulin Shang, Andi Han, Ioannis Tsaknakis, Quan Wei, Mingyi Hong

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can cut optimizer memory by ~20–25% and still get near-optimizer-level language quality by using low-rank training plus two simple tricks, reducing hardware cost and enabling larger pretraining runs on cheaper GPUs.

Who Should Care

Summary TLDR

This paper surveys recent methods for memory- and parameter-efficient LLM pretraining, runs a controlled benchmark on LLaMA models up to 1B parameters on the C4 corpus, and proposes two simple improvements—weight refactorization and momentum reset—that improve low-rank training. Key takeaways: properly tuned full-rank training still gives the best perplexity; adding high-rank updates to low-rank schemes boosts performance; applying the two proposed techniques to low-rank training yields comparable perplexity to memory-efficient optimizers while using roughly 20–25% less optimizer memory on the 1B test case. Code is released.

Problem Statement

Pretraining LLMs uses huge memory for weights, activations and optimizer states. Parameter- and memory-efficient techniques (low-rank adapters, optimizer-state compression) work well for fine-tuning, but it's unclear if they can match full-model pretraining on billions of tokens without large quality loss.

Main Contribution

A compact survey of recent parameter- and memory-efficient pretraining methods (optimizer projections, low-rank and sparse+low-rank factorizations, compression/quantization).

A benchmark comparing Full-Rank, GaLore, Fira, LoRA, Low-Rank, and SLTrain on LLaMA variants (60M–1B) trained on C4, using extensive hyperparameter sweeps.

Key Findings

Full-rank pretraining with proper optimizer and tuning achieves best perplexity in the benchmark.

Numbers1B Stable-SPAM full-rank PPL 13.97 (Table 3)

Practical UseIf you can afford full-rank + tuning, use it as the baseline; it still gives the best language-model quality on evaluated settings.

Evidence RefTable 3

Adding high-rank components to low-rank schemes improves performance across sizes.

Numbers1B Fira PPL 15.10 vs GaLore PPL 15.57 (Table 3)

Practical UsePrefer methods that restore full-rank updates (e.g., Fira, SLTrain) over pure low-rank projection when quality matters.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Perplexity13.97C4, 1B LLaMAStable-SPAM full-rank best PPL in Table 3Table 3
Perplexity15.1013.97 (full-rank)+1.13C4, 1B LLaMAFira optimizer-projection result in Table 3Table 3

What To Try In 7 Days

Run the repo's 60M or 130M pretraining on C4 to reproduce Table 3 baselines.

Add momentum reset (every 200 updates) to your AdamW training and compare validation perplexity.

Apply weight refactorization to low-rank factor matrices every ~200 updates and measure convergence speed and memory use.

Optimization Features

Infra Optimization
BF16 trainingActivation compression/quantization (surveyed)
Model Optimization
Low-rank factorization (W = BA)Sparse + low-rank (SLTrain)LoRA
System Optimization
Gradient projection (GaLore family)Block-wise optimizer updates (BAdam, Adam-mini)
Training Optimization
Weight refactorization (periodic SVD-based rebalance)Momentum reset (zero moments every 200 steps)Stable-SPAM techniques (gradient normalization, adaptive clipping)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Benchmarks run only up to 1B parameters; results may not transfer to multi-billion models.

Single dataset (C4) evaluated; domain shifts not tested.

When Not To Use

When you can afford full-rank training and need the absolute best perplexity (full-rank Stable-SPAM was best).

When your workflow requires validation beyond C4 or downstream task finetuning without re-evaluation.

Failure Modes

Low-rank methods can show unstable training on larger models unless carefully initialized and tuned.

Projection-based optimizers (GaLore) lose gradient information and may scale worse.

Core Entities

Models

LLaMALoRALow-RankSLTrainGaLoreFiraStable-SPAM

Metrics

PerplexityMemory (GB)Parameter count (M)

Datasets

C4

Benchmarks

Perplexity