Prompt LLMs to propose hyperparameters and training code; they match or beat standard HPO early in search.

Overview

Decision SnapshotNeeds Validation

The experiments show consistent improvements in low-budget HPO on several benchmarks and tasks, but results depend on LLM version, prompt format, and reproducibility limits of closed models.

Citations7

Evidence Strength0.70

Confidence0.80

Risk Signals12

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Michael R. Zhang, Nishkrit Desai, Juhan Bae, Jonathan Lorraine, Jimmy Ba

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs can find better hyperparameters faster than random search in low-budget settings, speeding model iteration and cutting compute cost when trials are expensive.

Who Should Care

ML Engineer Data Scientist Product Manager Engineering Lead CTO Founder

Summary TLDR

The authors prompt large language models (LLMs) such as GPT-4 to propose hyperparameter configurations (or source code that defines the model) in an iterative loop: propose → evaluate → report metric → propose. On standard HPOBench tasks and CIFAR-10 experiments, GPT-4 Turbo often outperforms random search and matches or beats Bayesian optimization in the low-budget regime (tens of evaluations). Code-generation by the LLM can also supply working model+optimizer code and gives strong initial configurations. The method is robust to modest measurement noise and benefits from compact prompts; limitations include reproducibility, cost, and possible hallucinations.

Problem Statement

Hyperparameter tuning is critical but hard when you have only a few trials or limited ML expertise. The paper asks: can general-purpose LLMs, prompted with problem descriptions and past results, recommend hyperparameters (or code) that perform well with small search budgets?

Main Contribution

An iterative LLM-driven HPO loop: prompt with problem + history, get hyperparameters, run, return metric, repeat.

Empirical evaluation on HPOBench (32 tasks) showing GPT-4 Turbo often beats random search with small budgets (10–30 evals).

Key Findings

GPT-4 Turbo beats random search on HPOBench in the 10-evaluation setting.

NumbersBeats random 81.25%; median error change 13.70%; mean change 19.83% (Table 1).

Practical UseIf you have ~10 trials, try GPT-4 Turbo proposals first — it often finds better configs than random search on these benchmarks.

Evidence RefTable 1 (HPOBench, 10 evaluations)

GPT-4 Turbo remains competitive at longer budgets (60–100 evaluations).

NumbersAt 60 iters: beats random 87.5%; median change 16.31%; mean 22.76% (Table 5).

Practical UseFor moderately longer searches, LLMs can still be useful and comparable to Bayesian optimization; use LLMs as an alternative or to complement BO.

Evidence RefTable 5 (60 iterations)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Beats Random	81.25%	random search	—	HPOBench (32 tasks), 10 evals	GPT-4 Turbo beats random 81.25% of tasks after 10 evaluations.	Table 1
Median change vs random	13.70% reduction in validation error	random search	—	HPOBench, GPT-4 Turbo, 10 evals	Median change in validation error vs random search is 13.70%.	Table 1

What To Try In 7 Days

Run a 10-eval HPO pilot: use GPT-4 Turbo with temperature 0 and compressed prompts.

Try LLM code-generation for 3–5 initial trials to get working model+optimizer code.

Seed a Bayesian optimizer with the first 10 LLM proposals and compare convergence vs vanilla BO.

Optimization Features

Token Efficiency

compressed prompts reduce token cost

Training Optimization

code-as-hyperparameterLLM-proposed optimizer choices

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/michaelrzhang/LLM-HyperOpt

Data URLs

https://arxiv.org/abs/2312.04528 https://www.kaggle.com/datasets/nagasai524/nyc-taxi-trip-records-from-jan-2023-to-jun-2023HPOBench (Eggensperger et al., 2021)

Risks & Boundaries

Limitations

Possible dataset contamination in LLM training data; results may be inflated on known benchmarks.

Closed-model inference is not fully reproducible; temperature 0 not always deterministic.

When Not To Use

When strict reproducibility or deterministic outputs are required from opaque LLMs.

When you have large budgets and mature BO pipelines that already converge reliably.

Failure Modes

Hallucinated or invalid code leading to runtime errors.

Recommending poor or previously tried hyperparameters (re-exploration).

Core Entities

Models

GPT-4GPT-4 Turbo (gpt-4-1106-preview)GPT-3.5-TurboLlama3-8B

Metrics

validation lossAccuracyminimum test lossbeats random (%)median change vs random (%)mean change vs random (%)mean rank

Datasets

HPOBench (Eggensperger et al.)CIFAR-10NYC Taxi (Kaggle)2D toy functions (Rosenbrock, Branin, Ackley, Himmelblau, quadratics)

Benchmarks

HPOBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 Turbo beats random search on HPOBench in the 10-evaluation setting.

GPT-4 Turbo remains competitive at longer budgets (60–100 evaluations).

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

BLPO: jointly optimize judge and caption prompts to better align multimodal LLM judges with human image judgments

Key finding

AutoPDL: AutoML that finds and returns editable, executable prompt programs for LLM agents

Key finding

Use evolutionary search to generate harmless prompts that trigger unnecessary LLM refusals, build tests and alignment data, and reduce over‑

Key finding

Find a model's true knowledge boundary by optimizing prompts that preserve meaning

Key finding

IPOMP: pick a small, diverse evaluation set and refine it from live model feedback to get better and more stable prompts

Key finding