Benchmarking zeroth-order (no-backprop) optimizers to cut LLM fine-tuning memory and explore practical trade-offs

February 18, 20247 min

Overview

Decision SnapshotNeeds Validation

The benchmark provides solid memory and accuracy numbers across models and PEFT schemes, but ZO accuracy is variable and sensitive to query budget, task alignment, and model/task scale.

Citations7

Evidence Strength0.60

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, Tianlong Chen

Links

Abstract / PDF / Code

Why It Matters For Business

ZO methods let multi-billion LLM fine-tuning run with much lower peak memory—often on a single high-memory GPU—reducing cloud costs and enabling training in constrained environments. But expect accuracy or compute trade-offs on hard tasks.

Who Should Care

Summary TLDR

This paper runs the first broad benchmark of zeroth-order (ZO) optimizers for fine-tuning large language models without back-propagation. ZO methods (which estimate gradients from loss values) sharply cut peak memory and can let multi-billion-parameter models fit on a single GPU, but they are slower and less accurate on harder tasks. The authors evaluate multiple ZO variants, show that prompt-style task alignment matters, introduce block-wise ZO, hybrid ZO–FO training and sparse perturbations to reduce variance, and publish code to reproduce results.

Problem Statement

Back-propagation causes large activation memory when fine-tuning LLMs. Can zeroth-order (loss-difference) or other BP-free methods enable memory-efficient fine-tuning of large models while keeping acceptable accuracy? The paper benchmarks many ZO variants, tasks, models and PEFT schemes to answer this.

Main Contribution

First systematic benchmark of ZO/BP-free optimizers across 5 LLM families, 3–4 tasks, and 5 fine-tuning schemes.

Empirical findings: task alignment and forward-gradient baselines matter; ZO methods trade memory for variance and runtime.

Key Findings

ZO optimizers cut peak memory by a large margin vs standard FO optimizers.

NumbersZO-SGD 64 GB vs FO-SGD 148 GB (peak) on OPT-13B/MultiRC

Practical UseUse ZO-SGD to fit full OPT-13B fine-tuning on 1×A100 instead of multiple GPUs; reduces hardware cost when memory is the bottleneck.

Evidence RefTable 4

ZO fine-tuning loses accuracy on harder tasks compared to FO methods.

NumbersWinoGrande: FO ~66.968.9% vs ZO ~62.664.0% (OPT-13B, LoRA)

Practical UseAvoid pure ZO for complex reasoning tasks if top accuracy matters; consider hybrid ZO–FO or FO methods instead.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Peak memoryZO-SGD 64 GB; FO-SGD 148 GB (OPT-13B, MultiRC)FO-SGD−84 GBOPT-13B / MultiRCTable 4: memory per optimizerTable 4
AccuracyFO-SGD 91.4%, ZO-Adam 89.8%, Forward-Grad 90.1%, ZO-SGD 89.4%FO-SGDZO-SGD −2.0 pts vs FOSST2 / Roberta-Large (Table 2)Table 2: SST2 results across optimizersTable 2

What To Try In 7 Days

Run a small ZO-SGD test on your LoRA adapters to verify single-GPU memory fit.

If accuracy drops, add prompt-style task alignment before ZO tuning.

Try block-wise ZO or ~20% perturbation sparsity to reduce ZO variance quickly.

Optimization Features

Token Efficiency
Accuracy
Infra Optimization
single A100 feasible vs multi-GPU FO setups
Model Optimization
block-wise parameter updatessparsity-induced perturbations
System Optimization
half-precision model loading (F16) for ZOrandom-seed trick to avoid storing perturbation vectors
Training Optimization
hybrid ZO–FO layer splitForward-Grad (forward-mode AD) baselineZO variants: ZO-SGD, ZO-Adam, ZO-SGD-MMT, ZO-SGD-Cons, ZO-SGD-Sign

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

High variance and slower convergence of ZO estimators at low query budgets.

ZO methods still lag FO on harder reasoning tasks (multi-sentence/commonsense).

When Not To Use

When peak memory is not a constraint and you need best possible accuracy on hard tasks.

When you cannot afford the extra runtime or large query counts to reduce ZO variance.

Failure Modes

Poor prompt/task alignment causes large ZO accuracy drops (~10%).

Using q=1 leads to noisy gradients and unstable tuning.

Core Entities

Models

Roberta-LargeOPT-1.3BOPT-13BLLaMA2-7BVicuna-7BMistral-7B

Metrics

Accuracypeak memory (GB)GPU countruntime per iteration (s)query budget (q)

Datasets

SST2COPAWinoGrandeMultiRC

Benchmarks

ZO-LLM benchmark (this paper)