Use an LLM (GPT-3.5) to warmstart, model, and sample for Bayesian optimization; improves early-stage hyperparameter tuning

February 6, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.4

Citation Count

11

Authors

Tennison Liu, Nicolás Astorga, Nabeel Seedat, Mihaela van der Schaar

Links

Abstract / PDF

Why It Matters For Business

LLAMBO can reduce the number of expensive evaluations in hyperparameter tuning by using an LLM for initial guesses, early surrogates, and targeted sampling; trade off higher per-iteration compute and API cost for fewer total experiments.

Summary TLDR

LLAMBO wraps a large language model (GPT-3.5) into the Bayesian optimization (BO) loop. The paper shows three practical uses: zero‑shot warmstarting (suggest initial configs), an LLM-based surrogate (predict scores and uncertainty from few examples), and an LLM conditional sampler (generate candidates for a target objective). Across 74 tasks (Bayesmark, HPOBench, private and synthetic data) LLAMBO speeds up early search and often lowers regret versus standard BO tools, at the cost of higher per-iteration compute and some weaker uncertainty calibration versus Gaussian processes.

Problem Statement

Bayesian optimization struggles when observations are very sparse. Surrogates and samplers need strong priors or many samples to find good regions fast. Can general-purpose LLMs, using in-context learning (ICL) and prompts, supply priors and few-shot generalization to improve BO components without finetuning?

Main Contribution

Introduce LLAMBO: a modular pipeline that uses an LLM via prompts to warmstart, act as a surrogate, and generate candidate points for BO.

Design prompt formats and ICL recipes for three BO components: zero-shot warmstarting, discriminative/generative surrogate modeling, and target-conditioned candidate sampling.

Systematic empirical study across 74 HPO tasks (Bayesmark, HPOBench, private and synthetic) showing sample-efficiency gains, especially early in search.

Open-source code for reproducing experiments: two GitHub repositories provided.

Key Findings

Zero-shot LLM warmstarting beats random initializations for HPO tasks.

Numbersevaluated over 25 trials with 5 init points; improvement visible for trials < 5

LLM-based discriminative surrogate improves prediction accuracy in few-shot regimes.

Numberstested at n={5,10,20,30}; largest gains at n=5

LLM sampler that conditions on a target value finds higher-quality candidates, especially early.

NumbersLLAMBO has lowest average and best regret when n=5; best α observed ≈ 0.01 for best-regret

End-to-end LLAMBO yields the best tuning performance across public and private HPO tasks used.

Numbersevaluated on 50 HPO tasks (25 Bayesmark public + private/synthetic); averaged over 5 seeds and 25 trials

Results

End-to-end tuning (average regret)

ValueLLAMBO lowest average regret on public and private/synthetic HPO tasks

BaselineGP-DKL, SKOpt(GP), Optuna(TPE), SMAC3

Surrogate prediction (NRMSE & R2)

ValueLLAMBO outperforms baselines in prediction accuracy

BaselineGP, SMAC

Surrogate uncertainty (calibration)

ValueLLAMBO worse calibration than GP but similar to SMAC

BaselineGP (best calibration)

Candidate sampling quality (regret of sampled points)

ValueLLAMBO sampler has lowest average and best regret, especially at n=5

BaselineTPE (Ind/Multi), Random

Who Should Care

What To Try In 7 Days

Prompt your LLM (GPT-3.5) for 5–10 warmstart hyperparameter configs and use them instead of random starts.

Use LLM-based discriminative surrogate for early rounds (when you have <10 runs) to guide the search.

Try the LLM conditional sampler with α ∈ {-0.2, -0.1, 0.01} and pick the α that balances diversity and improvement.

Agent Features

Memory

  • In-context learning (few-shot examples in prompts)

Tool Use

  • LLM for warmstarting
  • LLM as surrogate (discriminative & generative)
  • LLM for conditional candidate sampling

Frameworks

  • LLAMBO modular design (components can be integrated separately)

Architectures

  • LLM (GPT-3.5) used as a model-in-the-loop component

Optimization Features

Inference Optimization

  • Monte Carlo repeated LLM calls (K=10) with prompt shuffling to estimate uncertainty

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Higher per-iteration compute and API latency than standard BO; clock-time dominated by LLM calls.
  • Uncertainty calibration is weaker than Gaussian processes at low sample counts.
  • Experiments are on relatively low-dimensional HPO tasks; performance on high-dimensional search spaces is untested.
  • Results depend on the choice of LLM (paper uses GPT-3.5); other LLMs may behave differently.

When Not To Use

  • When per-query runtime or API cost is the primary constraint and you cannot afford LLM calls.
  • On very high-dimensional search spaces where prompt-based ICL may not capture structure.
  • If strict probabilistic uncertainty quantification is essential for decision making.

Failure Modes

  • Prompt-order sensitivity: model predictions change with example ordering, hurting calibration unless shuffled.
  • Majority-label bias for generative surrogate when the good/bad split τ is unbalanced.
  • Over-confident MC estimates if naive sampling is used without prompt shuffling.
  • LLM may propose invalid or out-of-range hyperparameters if instructions are incomplete (lower acceptance rate without proper instructions).

Core Entities

Models

  • gpt-3.5-turbo
  • Gaussian Process
  • SMAC (RandomForest surrogate)
  • TPE
  • LLAMBO (this work)

Metrics

  • normalized regret
  • NRMSE
  • R^2
  • log predictive density (LPD)
  • coverage
  • sharpness
  • generalized variance
  • log-likelihood

Datasets

  • Bayesmark
  • HPOBench
  • SEER (private)
  • MAGGIC (private)
  • CUTRACT (private)
  • Rosenbrock (synthetic)
  • Griewank (synthetic)
  • KTablet (synthetic)

Benchmarks

  • Bayesmark
  • HPOBench

Context Entities

Models

  • prompt templates for warmstarting
  • ICL (in-context learning) few-shot prompts
  • discriminative surrogate via MC/shuffle
  • generative surrogate as binary classifier