Use an LLM (GPT-3.5) to warmstart, model, and sample for Bayesian optimization; improves early-stage hyperparameter tuning

February 6, 20248 min

Overview

Decision SnapshotNeeds Validation

The method shows consistent early-stage gains across public and private HPO tasks, but relies on LLM API calls (higher runtime/cost) and shows weaker uncertainty calibration than GPs.

Citations11

Evidence Strength0.70

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 70%

Authors

Tennison Liu, Nicolás Astorga, Nabeel Seedat, Mihaela van der Schaar

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLAMBO can reduce the number of expensive evaluations in hyperparameter tuning by using an LLM for initial guesses, early surrogates, and targeted sampling; trade off higher per-iteration compute and API cost for fewer total experiments.

Who Should Care

Summary TLDR

LLAMBO wraps a large language model (GPT-3.5) into the Bayesian optimization (BO) loop. The paper shows three practical uses: zero‑shot warmstarting (suggest initial configs), an LLM-based surrogate (predict scores and uncertainty from few examples), and an LLM conditional sampler (generate candidates for a target objective). Across 74 tasks (Bayesmark, HPOBench, private and synthetic data) LLAMBO speeds up early search and often lowers regret versus standard BO tools, at the cost of higher per-iteration compute and some weaker uncertainty calibration versus Gaussian processes.

Problem Statement

Bayesian optimization struggles when observations are very sparse. Surrogates and samplers need strong priors or many samples to find good regions fast. Can general-purpose LLMs, using in-context learning (ICL) and prompts, supply priors and few-shot generalization to improve BO components without finetuning?

Main Contribution

Introduce LLAMBO: a modular pipeline that uses an LLM via prompts to warmstart, act as a surrogate, and generate candidate points for BO.

Design prompt formats and ICL recipes for three BO components: zero-shot warmstarting, discriminative/generative surrogate modeling, and target-conditioned candidate sampling.

Key Findings

Zero-shot LLM warmstarting beats random initializations for HPO tasks.

Numbersevaluated over 25 trials with 5 init points; improvement visible for trials < 5

Practical UseUse an LLM prompt to generate 5–20 warmstart configurations to accelerate early BO progress instead of pure random or Sobol starts.

Evidence RefSection 4; Figure 2 (Warmstarting)

LLM-based discriminative surrogate improves prediction accuracy in few-shot regimes.

Numberstested at n={5,10,20,30}; largest gains at n=5

Practical UseUse the LLM surrogate when you have <10 observations to get better mean predictions; combine later with a GP for calibration if needed.

Evidence RefSection 5.1; Figure 3 (NRMSE, R2)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
End-to-end tuning (average regret)LLAMBO lowest average regret on public and private/synthetic HPO tasksGP-DKL, SKOpt(GP), Optuna(TPE), SMAC3improvement concentrated in early trials (n < 10)Bayesmark public + private/synthetic (50 tasks)Section 7; Figure 7Figure 7
Surrogate prediction (NRMSE & R2)LLAMBO outperforms baselines in prediction accuracyGP, SMAClargest gains at n = 5 observationsevaluated over tasks with n in {5,10,20,30}Section 5.1; Figure 3 (Top)Figure 3

What To Try In 7 Days

Prompt your LLM (GPT-3.5) for 5–10 warmstart hyperparameter configs and use them instead of random starts.

Use LLM-based discriminative surrogate for early rounds (when you have <10 runs) to guide the search.

Try the LLM conditional sampler with α ∈ {-0.2, -0.1, 0.01} and pick the α that balances diversity and improvement.

Agent Features

Memory
In-context learning (few-shot examples in prompts)
Tool Use
LLM for warmstartingLLM as surrogate (discriminative & generative)LLM for conditional candidate sampling
Frameworks
LLAMBO modular design (components can be integrated separately)
Architectures
LLM (GPT-3.5) used as a model-in-the-loop component

Optimization Features

Inference Optimization
Monte Carlo repeated LLM calls (K=10) with prompt shuffling to estimate uncertainty

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

https://github.com/uber/bayesmarkHPOBench (referenced in paper)

Risks & Boundaries

Limitations

Higher per-iteration compute and API latency than standard BO; clock-time dominated by LLM calls.

Uncertainty calibration is weaker than Gaussian processes at low sample counts.

When Not To Use

When per-query runtime or API cost is the primary constraint and you cannot afford LLM calls.

On very high-dimensional search spaces where prompt-based ICL may not capture structure.

Failure Modes

Prompt-order sensitivity: model predictions change with example ordering, hurting calibration unless shuffled.

Majority-label bias for generative surrogate when the good/bad split τ is unbalanced.

Core Entities

Models

gpt-3.5-turboGaussian ProcessSMAC (RandomForest surrogate)TPELLAMBO (this work)

Metrics

normalized regretNRMSER^2log predictive density (LPD)coveragesharpnessgeneralized variancelog-likelihood

Datasets

BayesmarkHPOBenchSEER (private)MAGGIC (private)CUTRACT (private)Rosenbrock (synthetic)Griewank (synthetic)KTablet (synthetic)

Benchmarks

BayesmarkHPOBench

Context Entities

Models

prompt templates for warmstartingICL (in-context learning) few-shot promptsdiscriminative surrogate via MC/shufflegenerative surrogate as binary classifier