Prompt LLMs to propose hyperparameters and training code; they match or beat standard HPO early in search.

December 7, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

7

Authors

Michael R. Zhang, Nishkrit Desai, Juhan Bae, Jonathan Lorraine, Jimmy Ba

Links

Abstract / PDF

Why It Matters For Business

LLMs can find better hyperparameters faster than random search in low-budget settings, speeding model iteration and cutting compute cost when trials are expensive.

Summary TLDR

The authors prompt large language models (LLMs) such as GPT-4 to propose hyperparameter configurations (or source code that defines the model) in an iterative loop: propose → evaluate → report metric → propose. On standard HPOBench tasks and CIFAR-10 experiments, GPT-4 Turbo often outperforms random search and matches or beats Bayesian optimization in the low-budget regime (tens of evaluations). Code-generation by the LLM can also supply working model+optimizer code and gives strong initial configurations. The method is robust to modest measurement noise and benefits from compact prompts; limitations include reproducibility, cost, and possible hallucinations.

Problem Statement

Hyperparameter tuning is critical but hard when you have only a few trials or limited ML expertise. The paper asks: can general-purpose LLMs, prompted with problem descriptions and past results, recommend hyperparameters (or code) that perform well with small search budgets?

Main Contribution

An iterative LLM-driven HPO loop: prompt with problem + history, get hyperparameters, run, return metric, repeat.

Empirical evaluation on HPOBench (32 tasks) showing GPT-4 Turbo often beats random search with small budgets (10–30 evals).

Show LLMs can generate model+optimizer code; treating code as a hyperparameter gives strong initial settings with very small budgets (5 evals).

Study prompting choices: chat vs compressed history, optional chain-of-thought (CoT), and sensitivity to prompt detail and noisy metrics.

Demonstrate hybrid use: use LLM proposals to seed Bayesian optimization, improving performance in many tasks.

Key Findings

GPT-4 Turbo beats random search on HPOBench in the 10-evaluation setting.

NumbersBeats random 81.25%; median error change 13.70%; mean change 19.83% (Table 1).

GPT-4 Turbo remains competitive at longer budgets (60–100 evaluations).

NumbersAt 60 iters: beats random 87.5%; median change 16.31%; mean 22.76% (Table 5).

LLM code generation yields strong initial models in very small budgets (5 evals).

NumbersMin test loss (5 evals): code gen 2.754e-4 ±9.241e-5 vs random 3.757e-3 ±1.172e-3 (Table 4).

Chain-of-thought (CoT) gives mixed but sometimes helpful gains and useful explanations.

NumbersGPT-4 Turbo with/without CoT both beat random 81.25%; median change ~13.7–15.6% (Table 2).

LLM proposals are fairly robust to noisy evaluation metrics.

NumbersPerformance similar when validation metrics were multiplied by uniform noise in (0.9,1.1) (CIFAR-10 ablation).

LLMs can seed Bayesian optimization effectively.

NumbersUsing GPT-4 Turbo for the first 10 steps improved or matched BO on 65.6% of tasks (32 total).

Results

Beats Random

Value81.25%

Baselinerandom search

Median change vs random

Value13.70% reduction in validation error

Baselinerandom search

Min test loss (5 evals)

Value2.754e-4 ± 9.241e-5

Baselinerandom search 3.757e-3 ± 1.172e-3

Beats Random (longer trajectory)

Value87.50%

Baselinerandom search

Who Should Care

What To Try In 7 Days

Run a 10-eval HPO pilot: use GPT-4 Turbo with temperature 0 and compressed prompts.

Try LLM code-generation for 3–5 initial trials to get working model+optimizer code.

Seed a Bayesian optimizer with the first 10 LLM proposals and compare convergence vs vanilla BO.

Optimization Features

Token Efficiency

  • compressed prompts reduce token cost

Training Optimization

  • code-as-hyperparameter
  • LLM-proposed optimizer choices

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Possible dataset contamination in LLM training data; results may be inflated on known benchmarks.
  • Closed-model inference is not fully reproducible; temperature 0 not always deterministic.
  • Cost can grow with iterations for closed LLMs; model choice affects performance.
  • LLMs can hallucinate or suggest inappropriate regularization or architectures.
  • Performance can vary with random seed and prompt wording.

When Not To Use

  • When strict reproducibility or deterministic outputs are required from opaque LLMs.
  • When you have large budgets and mature BO pipelines that already converge reliably.
  • If you must avoid any risk of model-suggested code or security concerns from generated code.

Failure Modes

  • Hallucinated or invalid code leading to runtime errors.
  • Recommending poor or previously tried hyperparameters (re-exploration).
  • Overconfidence in rationale; explanations may not reflect causal effects.
  • Performance variation across LLM versions or random seeds.

Core Entities

Models

  • GPT-4
  • GPT-4 Turbo (gpt-4-1106-preview)
  • GPT-3.5-Turbo
  • Llama3-8B

Metrics

  • validation loss
  • Accuracy
  • minimum test loss
  • beats random (%)
  • median change vs random (%)
  • mean change vs random (%)
  • mean rank

Datasets

  • HPOBench (Eggensperger et al.)
  • CIFAR-10
  • NYC Taxi (Kaggle)
  • 2D toy functions (Rosenbrock, Branin, Ackley, Himmelblau, quadratics)

Benchmarks

  • HPOBench