Overview
The benchmark provides solid memory and accuracy numbers across models and PEFT schemes, but ZO accuracy is variable and sensitive to query budget, task alignment, and model/task scale.
Citations7
Evidence Strength0.60
Confidence0.86
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
ZO methods let multi-billion LLM fine-tuning run with much lower peak memory—often on a single high-memory GPU—reducing cloud costs and enabling training in constrained environments. But expect accuracy or compute trade-offs on hard tasks.
Who Should Care
Summary TLDR
This paper runs the first broad benchmark of zeroth-order (ZO) optimizers for fine-tuning large language models without back-propagation. ZO methods (which estimate gradients from loss values) sharply cut peak memory and can let multi-billion-parameter models fit on a single GPU, but they are slower and less accurate on harder tasks. The authors evaluate multiple ZO variants, show that prompt-style task alignment matters, introduce block-wise ZO, hybrid ZO–FO training and sparse perturbations to reduce variance, and publish code to reproduce results.
Problem Statement
Back-propagation causes large activation memory when fine-tuning LLMs. Can zeroth-order (loss-difference) or other BP-free methods enable memory-efficient fine-tuning of large models while keeping acceptable accuracy? The paper benchmarks many ZO variants, tasks, models and PEFT schemes to answer this.
Main Contribution
First systematic benchmark of ZO/BP-free optimizers across 5 LLM families, 3–4 tasks, and 5 fine-tuning schemes.
Empirical findings: task alignment and forward-gradient baselines matter; ZO methods trade memory for variance and runtime.
Key Findings
ZO optimizers cut peak memory by a large margin vs standard FO optimizers.
ZO fine-tuning loses accuracy on harder tasks compared to FO methods.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Peak memory | ZO-SGD 64 GB; FO-SGD 148 GB (OPT-13B, MultiRC) | FO-SGD | −84 GB | OPT-13B / MultiRC | Table 4: memory per optimizer | Table 4 |
| Accuracy | FO-SGD 91.4%, ZO-Adam 89.8%, Forward-Grad 90.1%, ZO-SGD 89.4% | FO-SGD | ZO-SGD −2.0 pts vs FO | SST2 / Roberta-Large (Table 2) | Table 2: SST2 results across optimizers | Table 2 |
What To Try In 7 Days
Run a small ZO-SGD test on your LoRA adapters to verify single-GPU memory fit.
If accuracy drops, add prompt-style task alignment before ZO tuning.
Try block-wise ZO or ~20% perturbation sparsity to reduce ZO variance quickly.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
High variance and slower convergence of ZO estimators at low query budgets.
ZO methods still lag FO on harder reasoning tasks (multi-sentence/commonsense).
When Not To Use
When peak memory is not a constraint and you need best possible accuracy on hard tasks.
When you cannot afford the extra runtime or large query counts to reduce ZO variance.
Failure Modes
Poor prompt/task alignment causes large ZO accuracy drops (~10%).
Using q=1 leads to noisy gradients and unstable tuning.

