Prompt LLMs to propose hyperparameters and training code; they match or beat standard HPO early in search.

December 7, 20237 min

Overview

Decision SnapshotNeeds Validation

The experiments show consistent improvements in low-budget HPO on several benchmarks and tasks, but results depend on LLM version, prompt format, and reproducibility limits of closed models.

Citations7

Evidence Strength0.70

Confidence0.80

Risk Signals12

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Michael R. Zhang, Nishkrit Desai, Juhan Bae, Jonathan Lorraine, Jimmy Ba

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs can find better hyperparameters faster than random search in low-budget settings, speeding model iteration and cutting compute cost when trials are expensive.

Who Should Care

Summary TLDR

The authors prompt large language models (LLMs) such as GPT-4 to propose hyperparameter configurations (or source code that defines the model) in an iterative loop: propose → evaluate → report metric → propose. On standard HPOBench tasks and CIFAR-10 experiments, GPT-4 Turbo often outperforms random search and matches or beats Bayesian optimization in the low-budget regime (tens of evaluations). Code-generation by the LLM can also supply working model+optimizer code and gives strong initial configurations. The method is robust to modest measurement noise and benefits from compact prompts; limitations include reproducibility, cost, and possible hallucinations.

Problem Statement

Hyperparameter tuning is critical but hard when you have only a few trials or limited ML expertise. The paper asks: can general-purpose LLMs, prompted with problem descriptions and past results, recommend hyperparameters (or code) that perform well with small search budgets?

Main Contribution

An iterative LLM-driven HPO loop: prompt with problem + history, get hyperparameters, run, return metric, repeat.

Empirical evaluation on HPOBench (32 tasks) showing GPT-4 Turbo often beats random search with small budgets (10–30 evals).

Key Findings

GPT-4 Turbo beats random search on HPOBench in the 10-evaluation setting.

NumbersBeats random 81.25%; median error change 13.70%; mean change 19.83% (Table 1).

Practical UseIf you have ~10 trials, try GPT-4 Turbo proposals first — it often finds better configs than random search on these benchmarks.

Evidence RefTable 1 (HPOBench, 10 evaluations)

GPT-4 Turbo remains competitive at longer budgets (60–100 evaluations).

NumbersAt 60 iters: beats random 87.5%; median change 16.31%; mean 22.76% (Table 5).

Practical UseFor moderately longer searches, LLMs can still be useful and comparable to Bayesian optimization; use LLMs as an alternative or to complement BO.

Evidence RefTable 5 (60 iterations)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Beats Random81.25%random searchHPOBench (32 tasks), 10 evalsGPT-4 Turbo beats random 81.25% of tasks after 10 evaluations.Table 1
Median change vs random13.70% reduction in validation errorrandom searchHPOBench, GPT-4 Turbo, 10 evalsMedian change in validation error vs random search is 13.70%.Table 1

What To Try In 7 Days

Run a 10-eval HPO pilot: use GPT-4 Turbo with temperature 0 and compressed prompts.

Try LLM code-generation for 3–5 initial trials to get working model+optimizer code.

Seed a Bayesian optimizer with the first 10 LLM proposals and compare convergence vs vanilla BO.

Optimization Features

Token Efficiency
compressed prompts reduce token cost
Training Optimization
code-as-hyperparameterLLM-proposed optimizer choices

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Possible dataset contamination in LLM training data; results may be inflated on known benchmarks.

Closed-model inference is not fully reproducible; temperature 0 not always deterministic.

When Not To Use

When strict reproducibility or deterministic outputs are required from opaque LLMs.

When you have large budgets and mature BO pipelines that already converge reliably.

Failure Modes

Hallucinated or invalid code leading to runtime errors.

Recommending poor or previously tried hyperparameters (re-exploration).

Core Entities

Models

GPT-4GPT-4 Turbo (gpt-4-1106-preview)GPT-3.5-TurboLlama3-8B

Metrics

validation lossAccuracyminimum test lossbeats random (%)median change vs random (%)mean change vs random (%)mean rank

Datasets

HPOBench (Eggensperger et al.)CIFAR-10NYC Taxi (Kaggle)2D toy functions (Rosenbrock, Branin, Ackley, Himmelblau, quadratics)

Benchmarks

HPOBench