Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
7
Why It Matters For Business
LLMs can find better hyperparameters faster than random search in low-budget settings, speeding model iteration and cutting compute cost when trials are expensive.
Summary TLDR
The authors prompt large language models (LLMs) such as GPT-4 to propose hyperparameter configurations (or source code that defines the model) in an iterative loop: propose → evaluate → report metric → propose. On standard HPOBench tasks and CIFAR-10 experiments, GPT-4 Turbo often outperforms random search and matches or beats Bayesian optimization in the low-budget regime (tens of evaluations). Code-generation by the LLM can also supply working model+optimizer code and gives strong initial configurations. The method is robust to modest measurement noise and benefits from compact prompts; limitations include reproducibility, cost, and possible hallucinations.
Problem Statement
Hyperparameter tuning is critical but hard when you have only a few trials or limited ML expertise. The paper asks: can general-purpose LLMs, prompted with problem descriptions and past results, recommend hyperparameters (or code) that perform well with small search budgets?
Main Contribution
An iterative LLM-driven HPO loop: prompt with problem + history, get hyperparameters, run, return metric, repeat.
Empirical evaluation on HPOBench (32 tasks) showing GPT-4 Turbo often beats random search with small budgets (10–30 evals).
Show LLMs can generate model+optimizer code; treating code as a hyperparameter gives strong initial settings with very small budgets (5 evals).
Study prompting choices: chat vs compressed history, optional chain-of-thought (CoT), and sensitivity to prompt detail and noisy metrics.
Demonstrate hybrid use: use LLM proposals to seed Bayesian optimization, improving performance in many tasks.
Key Findings
GPT-4 Turbo beats random search on HPOBench in the 10-evaluation setting.
GPT-4 Turbo remains competitive at longer budgets (60–100 evaluations).
LLM code generation yields strong initial models in very small budgets (5 evals).
Chain-of-thought (CoT) gives mixed but sometimes helpful gains and useful explanations.
LLM proposals are fairly robust to noisy evaluation metrics.
LLMs can seed Bayesian optimization effectively.
Results
Beats Random
Median change vs random
Min test loss (5 evals)
Beats Random (longer trajectory)
Who Should Care
What To Try In 7 Days
Run a 10-eval HPO pilot: use GPT-4 Turbo with temperature 0 and compressed prompts.
Try LLM code-generation for 3–5 initial trials to get working model+optimizer code.
Seed a Bayesian optimizer with the first 10 LLM proposals and compare convergence vs vanilla BO.
Optimization Features
Token Efficiency
- compressed prompts reduce token cost
Training Optimization
- code-as-hyperparameter
- LLM-proposed optimizer choices
Reproducibility
Data Urls
- https://arxiv.org/abs/2312.04528
- https://www.kaggle.com/datasets/nagasai524/nyc-taxi-trip-records-from-jan-2023-to-jun-2023
- HPOBench (Eggensperger et al., 2021)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Possible dataset contamination in LLM training data; results may be inflated on known benchmarks.
- Closed-model inference is not fully reproducible; temperature 0 not always deterministic.
- Cost can grow with iterations for closed LLMs; model choice affects performance.
- LLMs can hallucinate or suggest inappropriate regularization or architectures.
- Performance can vary with random seed and prompt wording.
When Not To Use
- When strict reproducibility or deterministic outputs are required from opaque LLMs.
- When you have large budgets and mature BO pipelines that already converge reliably.
- If you must avoid any risk of model-suggested code or security concerns from generated code.
Failure Modes
- Hallucinated or invalid code leading to runtime errors.
- Recommending poor or previously tried hyperparameters (re-exploration).
- Overconfidence in rationale; explanations may not reflect causal effects.
- Performance variation across LLM versions or random seeds.
Core Entities
Models
- GPT-4
- GPT-4 Turbo (gpt-4-1106-preview)
- GPT-3.5-Turbo
- Llama3-8B
Metrics
- validation loss
- Accuracy
- minimum test loss
- beats random (%)
- median change vs random (%)
- mean change vs random (%)
- mean rank
Datasets
- HPOBench (Eggensperger et al.)
- CIFAR-10
- NYC Taxi (Kaggle)
- 2D toy functions (Rosenbrock, Branin, Ackley, Himmelblau, quadratics)
Benchmarks
- HPOBench

