Benchmarking zeroth-order (no-backprop) optimizers to cut LLM fine-tuning memory and explore practical trade-offs

Overview

Decision SnapshotNeeds Validation

The benchmark provides solid memory and accuracy numbers across models and PEFT schemes, but ZO accuracy is variable and sensitive to query budget, task alignment, and model/task scale.

Citations7

Evidence Strength0.60

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, Tianlong Chen

Links

Abstract / PDF / Code

Why It Matters For Business

ZO methods let multi-billion LLM fine-tuning run with much lower peak memory—often on a single high-memory GPU—reducing cloud costs and enabling training in constrained environments. But expect accuracy or compute trade-offs on hard tasks.

Who Should Care

ML Engineer CTO Founder Product Manager

Summary TLDR

This paper runs the first broad benchmark of zeroth-order (ZO) optimizers for fine-tuning large language models without back-propagation. ZO methods (which estimate gradients from loss values) sharply cut peak memory and can let multi-billion-parameter models fit on a single GPU, but they are slower and less accurate on harder tasks. The authors evaluate multiple ZO variants, show that prompt-style task alignment matters, introduce block-wise ZO, hybrid ZO–FO training and sparse perturbations to reduce variance, and publish code to reproduce results.

Problem Statement

Back-propagation causes large activation memory when fine-tuning LLMs. Can zeroth-order (loss-difference) or other BP-free methods enable memory-efficient fine-tuning of large models while keeping acceptable accuracy? The paper benchmarks many ZO variants, tasks, models and PEFT schemes to answer this.

Main Contribution

First systematic benchmark of ZO/BP-free optimizers across 5 LLM families, 3–4 tasks, and 5 fine-tuning schemes.

Empirical findings: task alignment and forward-gradient baselines matter; ZO methods trade memory for variance and runtime.

Key Findings

ZO optimizers cut peak memory by a large margin vs standard FO optimizers.

NumbersZO-SGD 64 GB vs FO-SGD 148 GB (peak) on OPT-13B/MultiRC

Practical UseUse ZO-SGD to fit full OPT-13B fine-tuning on 1×A100 instead of multiple GPUs; reduces hardware cost when memory is the bottleneck.

Evidence RefTable 4

ZO fine-tuning loses accuracy on harder tasks compared to FO methods.

NumbersWinoGrande: FO ~66.9–68.9% vs ZO ~62.6–64.0% (OPT-13B, LoRA)

Practical UseAvoid pure ZO for complex reasoning tasks if top accuracy matters; consider hybrid ZO–FO or FO methods instead.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Peak memory	ZO-SGD 64 GB; FO-SGD 148 GB (OPT-13B, MultiRC)	FO-SGD	−84 GB	OPT-13B / MultiRC	Table 4: memory per optimizer	Table 4
Accuracy	FO-SGD 91.4%, ZO-Adam 89.8%, Forward-Grad 90.1%, ZO-SGD 89.4%	FO-SGD	ZO-SGD −2.0 pts vs FO	SST2 / Roberta-Large (Table 2)	Table 2: SST2 results across optimizers	Table 2

What To Try In 7 Days

Run a small ZO-SGD test on your LoRA adapters to verify single-GPU memory fit.

If accuracy drops, add prompt-style task alignment before ZO tuning.

Try block-wise ZO or ~20% perturbation sparsity to reduce ZO variance quickly.

Optimization Features

Token Efficiency

Accuracy

Infra Optimization

single A100 feasible vs multi-GPU FO setups

Model Optimization

block-wise parameter updatessparsity-induced perturbations

System Optimization

half-precision model loading (F16) for ZOrandom-seed trick to avoid storing perturbation vectors

Training Optimization

hybrid ZO–FO layer splitForward-Grad (forward-mode AD) baselineZO variants: ZO-SGD, ZO-Adam, ZO-SGD-MMT, ZO-SGD-Cons, ZO-SGD-Sign

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/ZO-Bench/ZO-LLM

Risks & Boundaries

Limitations

High variance and slower convergence of ZO estimators at low query budgets.

ZO methods still lag FO on harder reasoning tasks (multi-sentence/commonsense).

When Not To Use

When peak memory is not a constraint and you need best possible accuracy on hard tasks.

When you cannot afford the extra runtime or large query counts to reduce ZO variance.

Failure Modes

Poor prompt/task alignment causes large ZO accuracy drops (~10%).

Using q=1 leads to noisy gradients and unstable tuning.

Core Entities

Models

Roberta-LargeOPT-1.3BOPT-13BLLaMA2-7BVicuna-7BMistral-7B

Metrics

Accuracypeak memory (GB)GPU countruntime per iteration (s)query budget (q)

Datasets

SST2COPAWinoGrandeMultiRC

Benchmarks

ZO-LLM benchmark (this paper)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ZO optimizers cut peak memory by a large margin vs standard FO optimizers.

ZO fine-tuning loses accuracy on harder tasks compared to FO methods.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

FlowerTune: an open leaderboard to benchmark federated fine-tuning of LLMs across NLP, finance, medical and code

Key finding

Recover lost accuracy in corrupted small LMs by training tiny LoRA adapters with synthetic data and logit distillation

Key finding

A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

Key finding

SWIFT: an open-source, one-stop framework to fine-tune, evaluate, quantize and deploy over 550 LLMs and 200+ MLLMs

Key finding

MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

Key finding