Benchmarking zeroth-order (no-backprop) optimizers to cut LLM fine-tuning memory and explore practical trade-offs

February 18, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

7

Authors

Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D. Lee, Wotao Yin, Mingyi Hong, Zhangyang Wang, Sijia Liu, Tianlong Chen

Links

Abstract / PDF

Why It Matters For Business

ZO methods let multi-billion LLM fine-tuning run with much lower peak memory—often on a single high-memory GPU—reducing cloud costs and enabling training in constrained environments. But expect accuracy or compute trade-offs on hard tasks.

Summary TLDR

This paper runs the first broad benchmark of zeroth-order (ZO) optimizers for fine-tuning large language models without back-propagation. ZO methods (which estimate gradients from loss values) sharply cut peak memory and can let multi-billion-parameter models fit on a single GPU, but they are slower and less accurate on harder tasks. The authors evaluate multiple ZO variants, show that prompt-style task alignment matters, introduce block-wise ZO, hybrid ZO–FO training and sparse perturbations to reduce variance, and publish code to reproduce results.

Problem Statement

Back-propagation causes large activation memory when fine-tuning LLMs. Can zeroth-order (loss-difference) or other BP-free methods enable memory-efficient fine-tuning of large models while keeping acceptable accuracy? The paper benchmarks many ZO variants, tasks, models and PEFT schemes to answer this.

Main Contribution

First systematic benchmark of ZO/BP-free optimizers across 5 LLM families, 3–4 tasks, and 5 fine-tuning schemes.

Empirical findings: task alignment and forward-gradient baselines matter; ZO methods trade memory for variance and runtime.

Three practical improvements: block-wise ZO, hybrid ZO–FO splitting, and sparsity-induced ZO gradient estimation.

Open-source code: https://github.com/ZO-Bench/ZO-LLM for reproducing experiments.

Key Findings

ZO optimizers cut peak memory by a large margin vs standard FO optimizers.

NumbersZO-SGD 64 GB vs FO-SGD 148 GB (peak) on OPT-13B/MultiRC

ZO fine-tuning loses accuracy on harder tasks compared to FO methods.

NumbersWinoGrande: FO ~66.9–68.9% vs ZO ~62.6–64.0% (OPT-13B, LoRA)

Aligning downstream tasks to pretraining format (prompt-based) strongly helps ZO methods.

NumbersZO-SGD SST2 drops 89.4%→79.2% (−10.2%) without alignment

Block-wise ZO and modest gradient sparsity improve accuracy vs vanilla ZO at similar query cost.

NumbersBlock ZO SST2 +2.86 pts (90.83→93.69); sparsity 20% SST2 +1.37 pts (90.83→92.20)

Forward-gradient (forward-mode AD) outperforms ZO when many queries are used.

NumbersWith q>500, Forward-Grad beats ZO-SGD by ~1–2% and approaches FO-SGD

Results

Peak memory

ValueZO-SGD 64 GB; FO-SGD 148 GB (OPT-13B, MultiRC)

BaselineFO-SGD

Accuracy

ValueFO-SGD 91.4%, ZO-Adam 89.8%, Forward-Grad 90.1%, ZO-SGD 89.4%

BaselineFO-SGD

Accuracy

ValueFO-Adam 68.9%, ZO-Adam 64.0%, ZO-SGD 62.6%

BaselineFO-Adam

Block-wise ZO improvement

ValueSST2: 90.83% → 93.69% (p=26 blocks)

BaselineMeZO (ZO-SGD q=1)

Sparsity-induced ZO

ValueSST2: 90.83% → 92.20% (20% sparsity); COPA: 73% → 75% (20% sparsity)

Baselinevanilla ZO-SGD (0% sparsity)

Who Should Care

What To Try In 7 Days

Run a small ZO-SGD test on your LoRA adapters to verify single-GPU memory fit.

If accuracy drops, add prompt-style task alignment before ZO tuning.

Try block-wise ZO or ~20% perturbation sparsity to reduce ZO variance quickly.

Optimization Features

Token Efficiency

  • Accuracy

Infra Optimization

  • single A100 feasible vs multi-GPU FO setups

Model Optimization

  • block-wise parameter updates
  • sparsity-induced perturbations

System Optimization

  • half-precision model loading (F16) for ZO
  • random-seed trick to avoid storing perturbation vectors

Training Optimization

  • hybrid ZO–FO layer split
  • Forward-Grad (forward-mode AD) baseline
  • ZO variants: ZO-SGD, ZO-Adam, ZO-SGD-MMT, ZO-SGD-Cons, ZO-SGD-Sign

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • High variance and slower convergence of ZO estimators at low query budgets.
  • ZO methods still lag FO on harder reasoning tasks (multi-sentence/commonsense).
  • Forward-Grad needs forward-mode AD and is less black-box-friendly.
  • Higher query budgets reduce error but raise compute and runtime linearly.

When Not To Use

  • When peak memory is not a constraint and you need best possible accuracy on hard tasks.
  • When you cannot afford the extra runtime or large query counts to reduce ZO variance.
  • When the model/framework cannot support forward-mode AD and you wanted Forward-Grad benefits.

Failure Modes

  • Poor prompt/task alignment causes large ZO accuracy drops (~10%).
  • Using q=1 leads to noisy gradients and unstable tuning.
  • ZO-Adam and complex ZO methods can lose the memory advantage if implemented without F16.
  • Over-sparsification (>70–90%) can degrade performance or make training unstable.

Core Entities

Models

  • Roberta-Large
  • OPT-1.3B
  • OPT-13B
  • LLaMA2-7B
  • Vicuna-7B
  • Mistral-7B

Metrics

  • Accuracy
  • peak memory (GB)
  • GPU count
  • runtime per iteration (s)
  • query budget (q)

Datasets

  • SST2
  • COPA
  • WinoGrande
  • MultiRC

Benchmarks

  • ZO-LLM benchmark (this paper)