Overview
Experiments on multiple models and benchmarks plus a theoretical descent result support practical use, but stochastic convergence is not proven and some claims depend on block partitioning choices.
Citations1
Evidence Strength0.80
Confidence0.90
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 65%
Production readiness: 70%
Novelty: 45%
Why It Matters For Business
BAdam lets teams do full-parameter finetuning of 8B+ LLMs on single 24GB GPUs, cutting infrastructure cost and widening access to higher-quality fine-tuned models.
Who Should Care
Summary TLDR
BAdam is a block-coordinate-descent optimizer that runs Adam inside blocks so you only keep full-precision optimizer state for the active block. It cuts gradient/optimizer memory dramatically, enabling full-parameter finetuning of Llama-scale models on much smaller GPUs. Experiments show BAdam reduces memory and backward time vs Adam and LoRA, achieves similar or better downstream scores (MT-bench, math benchmarks), and keeps high-rank model updates. Code is on GitHub for PyTorch integration.
Problem Statement
Full-parameter finetuning with Adam needs large GPU RAM (roughly 18× model size in GB for optimizer states), so practitioners with limited GPUs must choose low-rank PEFTs like LoRA that can limit performance. The paper asks: can we do true full-parameter finetuning with much less optimizer/gradient memory?
Main Contribution
BAdam: a block coordinate descent (BCD) optimizer that runs K Adam steps on one parameter block at a time and clears optimizer states per block to save memory.
Memory and BP-time analysis showing BAdam stores FP32 optimizer state only for the active block and reduces backward computation for module-wise partitions.
Key Findings
BAdam reduces total GPU memory needed to finetune Llama 3-8B to ~23.5GB vs ~144.8GB+ for Adam.
BAdam cuts backward pass time roughly in half vs LoRA/LOMO for Llama 3-8B.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Total GPU memory for Llama 3-8B finetuning | 23.5GB (BAdam) | 144.8GB+ (Adam) | ≈ -121.3GB | Alpaca-GPT4 finetuning | Table 2 reports BAdam 23.5GB vs Adam 144.8GB+ | Table 2 (Section 3.1) |
| Backward time per epoch | 1.74 hours (BAdam) | 3.20 hours (LoRA) | ≈ -1.46 hours | Llama 3-8B, Alpaca-GPT4 | Table 3 reports backward times averaged over 3 epochs | Table 3 (Section 3.1) |
What To Try In 7 Days
Clone the BAdam repo and run the provided Llama-3-8B Alpaca-GPT4 script on a 24GB GPU to validate memory/time claims.
Replace LoRA in one instruction-tuning pipeline with BAdam and compare MT-bench and training time.
If you use mixed precision, test the consecutive-module block partition to speed backward passes.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Theory covers deterministic gradients; stochastic convergence left for future work.
Best gains rely on module-consecutive block partition; arbitrary partitions may help less.
When Not To Use
If you already have large multi-GPU memory and standard Adam fits, the extra code complexity may not be worth it.
When your training regime requires optimizer state continuity across all parameters (e.g., some specialized adaptive schemes).
Failure Modes
Choosing K too large may over-optimize a block and harm generalization or slow whole-model progress.
Poor block ordering may slow early convergence (ordering matters empirically).

