Block-wise Adam that lets you full-finetune 8B+ LLMs on a single 24GB GPU

Overview

Decision SnapshotNeeds Validation

Experiments on multiple models and benchmarks plus a theoretical descent result support practical use, but stochastic convergence is not proven and some claims depend on block partitioning choices.

Citations1

Evidence Strength0.80

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 70%

Novelty: 45%

Authors

Qijun Luo, Hengxu Yu, Xiao Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

BAdam lets teams do full-parameter finetuning of 8B+ LLMs on single 24GB GPUs, cutting infrastructure cost and widening access to higher-quality fine-tuned models.

Who Should Care

CTO ML Engineer Engineering Lead Data Scientist Founder

Summary TLDR

BAdam is a block-coordinate-descent optimizer that runs Adam inside blocks so you only keep full-precision optimizer state for the active block. It cuts gradient/optimizer memory dramatically, enabling full-parameter finetuning of Llama-scale models on much smaller GPUs. Experiments show BAdam reduces memory and backward time vs Adam and LoRA, achieves similar or better downstream scores (MT-bench, math benchmarks), and keeps high-rank model updates. Code is on GitHub for PyTorch integration.

Problem Statement

Full-parameter finetuning with Adam needs large GPU RAM (roughly 18× model size in GB for optimizer states), so practitioners with limited GPUs must choose low-rank PEFTs like LoRA that can limit performance. The paper asks: can we do true full-parameter finetuning with much less optimizer/gradient memory?

Main Contribution

BAdam: a block coordinate descent (BCD) optimizer that runs K Adam steps on one parameter block at a time and clears optimizer states per block to save memory.

Memory and BP-time analysis showing BAdam stores FP32 optimizer state only for the active block and reduces backward computation for module-wise partitions.

Key Findings

BAdam reduces total GPU memory needed to finetune Llama 3-8B to ~23.5GB vs ~144.8GB+ for Adam.

Numbers23.5GB (BAdam) vs 144.8GB+ (Adam); Table 2

Practical UseYou can full-parameter finetune an 8B model on one 24GB GPU instead of needing large multi-A100 setups.

Evidence RefTable 2 (Section 3.1)

BAdam cuts backward pass time roughly in half vs LoRA/LOMO for Llama 3-8B.

NumbersBackward per epoch: 1.74h (BAdam) vs 3.20h (LoRA) and 3.70h (LOMO); Table 3

Practical UseFewer GPU-hours for training loops when using module-wise blocks, improving throughput on constrained hardware.

Evidence RefTable 3 (Section 3.1)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Total GPU memory for Llama 3-8B finetuning	23.5GB (BAdam)	144.8GB+ (Adam)	≈ -121.3GB	Alpaca-GPT4 finetuning	Table 2 reports BAdam 23.5GB vs Adam 144.8GB+	Table 2 (Section 3.1)
Backward time per epoch	1.74 hours (BAdam)	3.20 hours (LoRA)	≈ -1.46 hours	Llama 3-8B, Alpaca-GPT4	Table 3 reports backward times averaged over 3 epochs	Table 3 (Section 3.1)

What To Try In 7 Days

Clone the BAdam repo and run the provided Llama-3-8B Alpaca-GPT4 script on a 24GB GPU to validate memory/time claims.

Replace LoRA in one instruction-tuning pipeline with BAdam and compare MT-bench and training time.

If you use mixed precision, test the consecutive-module block partition to speed backward passes.

Optimization Features

Infra Optimization

Enables single-GPU finetuning of 8B models (24GB)Reduces optimizer/gradient memory vs Adam; avoids expensive CPU/GPU offload

Model Optimization

Full-parameter updates preserved (no low-rank constraint)Learned updates retain high effective rank similar to Adam

System Optimization

Mixed precision: global FP16 weights, FP32 for active blockSupports gradient accumulationModule-based partition reduces backward computation for shallow blocks

Training Optimization

Block coordinate descent: update one block at a time (D blocks)Run K inner Adam steps per active block (K is a new hyperparameter)Clears optimizer states after each block to save memory

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Ledzy/BAdam

Data URLs

Alpaca-GPT4 (public dataset)MathInstruct (public dataset)StarCoder-Python (public dataset)SuperGLUE (public dataset)

Risks & Boundaries

Limitations

Theory covers deterministic gradients; stochastic convergence left for future work.

Best gains rely on module-consecutive block partition; arbitrary partitions may help less.

When Not To Use

If you already have large multi-GPU memory and standard Adam fits, the extra code complexity may not be worth it.

When your training regime requires optimizer state continuity across all parameters (e.g., some specialized adaptive schemes).

Failure Modes

Choosing K too large may over-optimize a block and harm generalization or slow whole-model progress.

Poor block ordering may slow early convergence (ordering matters empirically).

Core Entities

Models

Llama 3-8BLlama 3-70BLlama 2-7BLlama 3.1-8B-InstructRoBERTa-large

Metrics

MT-bench scoreAccuracyGPU memory (GB)wall-clock time per epoch (hours)

Datasets

Alpaca-GPT4MathInstructStarCoder-PythonSuperGLUEGSM8KMATHAquaMMLU-MathSAT-MathNumGLUE

Benchmarks

MT-benchGSM8KAquaMMLU-MathSAT-MathMATHNumGLUESuperGLUE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

BAdam reduces total GPU memory needed to finetune Llama 3-8B to ~23.5GB vs ~144.8GB+ for Adam.

BAdam cuts backward pass time roughly in half vs LoRA/LOMO for Llama 3-8B.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

FlowerTune: an open leaderboard to benchmark federated fine-tuning of LLMs across NLP, finance, medical and code

Key finding

Recover lost accuracy in corrupted small LMs by training tiny LoRA adapters with synthetic data and logit distillation

Key finding

A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

Key finding

SWIFT: an open-source, one-stop framework to fine-tune, evaluate, quantize and deploy over 550 LLMs and 200+ MLLMs

Key finding

MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

Key finding