Block-wise Adam that lets you full-finetune 8B+ LLMs on a single 24GB GPU

April 3, 20248 min

Overview

Decision SnapshotNeeds Validation

Experiments on multiple models and benchmarks plus a theoretical descent result support practical use, but stochastic convergence is not proven and some claims depend on block partitioning choices.

Citations1

Evidence Strength0.80

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 70%

Novelty: 45%

Authors

Qijun Luo, Hengxu Yu, Xiao Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

BAdam lets teams do full-parameter finetuning of 8B+ LLMs on single 24GB GPUs, cutting infrastructure cost and widening access to higher-quality fine-tuned models.

Who Should Care

Summary TLDR

BAdam is a block-coordinate-descent optimizer that runs Adam inside blocks so you only keep full-precision optimizer state for the active block. It cuts gradient/optimizer memory dramatically, enabling full-parameter finetuning of Llama-scale models on much smaller GPUs. Experiments show BAdam reduces memory and backward time vs Adam and LoRA, achieves similar or better downstream scores (MT-bench, math benchmarks), and keeps high-rank model updates. Code is on GitHub for PyTorch integration.

Problem Statement

Full-parameter finetuning with Adam needs large GPU RAM (roughly 18× model size in GB for optimizer states), so practitioners with limited GPUs must choose low-rank PEFTs like LoRA that can limit performance. The paper asks: can we do true full-parameter finetuning with much less optimizer/gradient memory?

Main Contribution

BAdam: a block coordinate descent (BCD) optimizer that runs K Adam steps on one parameter block at a time and clears optimizer states per block to save memory.

Memory and BP-time analysis showing BAdam stores FP32 optimizer state only for the active block and reduces backward computation for module-wise partitions.

Key Findings

BAdam reduces total GPU memory needed to finetune Llama 3-8B to ~23.5GB vs ~144.8GB+ for Adam.

Numbers23.5GB (BAdam) vs 144.8GB+ (Adam); Table 2

Practical UseYou can full-parameter finetune an 8B model on one 24GB GPU instead of needing large multi-A100 setups.

Evidence RefTable 2 (Section 3.1)

BAdam cuts backward pass time roughly in half vs LoRA/LOMO for Llama 3-8B.

NumbersBackward per epoch: 1.74h (BAdam) vs 3.20h (LoRA) and 3.70h (LOMO); Table 3

Practical UseFewer GPU-hours for training loops when using module-wise blocks, improving throughput on constrained hardware.

Evidence RefTable 3 (Section 3.1)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Total GPU memory for Llama 3-8B finetuning23.5GB (BAdam)144.8GB+ (Adam)≈ -121.3GBAlpaca-GPT4 finetuningTable 2 reports BAdam 23.5GB vs Adam 144.8GB+Table 2 (Section 3.1)
Backward time per epoch1.74 hours (BAdam)3.20 hours (LoRA)≈ -1.46 hoursLlama 3-8B, Alpaca-GPT4Table 3 reports backward times averaged over 3 epochsTable 3 (Section 3.1)

What To Try In 7 Days

Clone the BAdam repo and run the provided Llama-3-8B Alpaca-GPT4 script on a 24GB GPU to validate memory/time claims.

Replace LoRA in one instruction-tuning pipeline with BAdam and compare MT-bench and training time.

If you use mixed precision, test the consecutive-module block partition to speed backward passes.

Optimization Features

Infra Optimization
Enables single-GPU finetuning of 8B models (24GB)Reduces optimizer/gradient memory vs Adam; avoids expensive CPU/GPU offload
Model Optimization
Full-parameter updates preserved (no low-rank constraint)Learned updates retain high effective rank similar to Adam
System Optimization
Mixed precision: global FP16 weights, FP32 for active blockSupports gradient accumulationModule-based partition reduces backward computation for shallow blocks
Training Optimization
Block coordinate descent: update one block at a time (D blocks)Run K inner Adam steps per active block (K is a new hyperparameter)Clears optimizer states after each block to save memory

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Alpaca-GPT4 (public dataset)MathInstruct (public dataset)StarCoder-Python (public dataset)SuperGLUE (public dataset)

Risks & Boundaries

Limitations

Theory covers deterministic gradients; stochastic convergence left for future work.

Best gains rely on module-consecutive block partition; arbitrary partitions may help less.

When Not To Use

If you already have large multi-GPU memory and standard Adam fits, the extra code complexity may not be worth it.

When your training regime requires optimizer state continuity across all parameters (e.g., some specialized adaptive schemes).

Failure Modes

Choosing K too large may over-optimize a block and harm generalization or slow whole-model progress.

Poor block ordering may slow early convergence (ordering matters empirically).

Core Entities

Models

Llama 3-8BLlama 3-70BLlama 2-7BLlama 3.1-8B-InstructRoBERTa-large

Metrics

MT-bench scoreAccuracyGPU memory (GB)wall-clock time per epoch (hours)

Datasets

Alpaca-GPT4MathInstructStarCoder-PythonSuperGLUEGSM8KMATHAquaMMLU-MathSAT-MathNumGLUE

Benchmarks

MT-benchGSM8KAquaMMLU-MathSAT-MathMATHNumGLUESuperGLUE