Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.4
Citation Count
11
Why It Matters For Business
LLAMBO can reduce the number of expensive evaluations in hyperparameter tuning by using an LLM for initial guesses, early surrogates, and targeted sampling; trade off higher per-iteration compute and API cost for fewer total experiments.
Summary TLDR
LLAMBO wraps a large language model (GPT-3.5) into the Bayesian optimization (BO) loop. The paper shows three practical uses: zero‑shot warmstarting (suggest initial configs), an LLM-based surrogate (predict scores and uncertainty from few examples), and an LLM conditional sampler (generate candidates for a target objective). Across 74 tasks (Bayesmark, HPOBench, private and synthetic data) LLAMBO speeds up early search and often lowers regret versus standard BO tools, at the cost of higher per-iteration compute and some weaker uncertainty calibration versus Gaussian processes.
Problem Statement
Bayesian optimization struggles when observations are very sparse. Surrogates and samplers need strong priors or many samples to find good regions fast. Can general-purpose LLMs, using in-context learning (ICL) and prompts, supply priors and few-shot generalization to improve BO components without finetuning?
Main Contribution
Introduce LLAMBO: a modular pipeline that uses an LLM via prompts to warmstart, act as a surrogate, and generate candidate points for BO.
Design prompt formats and ICL recipes for three BO components: zero-shot warmstarting, discriminative/generative surrogate modeling, and target-conditioned candidate sampling.
Systematic empirical study across 74 HPO tasks (Bayesmark, HPOBench, private and synthetic) showing sample-efficiency gains, especially early in search.
Open-source code for reproducing experiments: two GitHub repositories provided.
Key Findings
Zero-shot LLM warmstarting beats random initializations for HPO tasks.
LLM-based discriminative surrogate improves prediction accuracy in few-shot regimes.
LLM sampler that conditions on a target value finds higher-quality candidates, especially early.
End-to-end LLAMBO yields the best tuning performance across public and private HPO tasks used.
Results
End-to-end tuning (average regret)
Surrogate prediction (NRMSE & R2)
Surrogate uncertainty (calibration)
Candidate sampling quality (regret of sampled points)
Who Should Care
What To Try In 7 Days
Prompt your LLM (GPT-3.5) for 5–10 warmstart hyperparameter configs and use them instead of random starts.
Use LLM-based discriminative surrogate for early rounds (when you have <10 runs) to guide the search.
Try the LLM conditional sampler with α ∈ {-0.2, -0.1, 0.01} and pick the α that balances diversity and improvement.
Agent Features
Memory
- In-context learning (few-shot examples in prompts)
Tool Use
- LLM for warmstarting
- LLM as surrogate (discriminative & generative)
- LLM for conditional candidate sampling
Frameworks
- LLAMBO modular design (components can be integrated separately)
Architectures
- LLM (GPT-3.5) used as a model-in-the-loop component
Optimization Features
Inference Optimization
- Monte Carlo repeated LLM calls (K=10) with prompt shuffling to estimate uncertainty
Reproducibility
Data Urls
- https://github.com/uber/bayesmark
- HPOBench (referenced in paper)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Higher per-iteration compute and API latency than standard BO; clock-time dominated by LLM calls.
- Uncertainty calibration is weaker than Gaussian processes at low sample counts.
- Experiments are on relatively low-dimensional HPO tasks; performance on high-dimensional search spaces is untested.
- Results depend on the choice of LLM (paper uses GPT-3.5); other LLMs may behave differently.
When Not To Use
- When per-query runtime or API cost is the primary constraint and you cannot afford LLM calls.
- On very high-dimensional search spaces where prompt-based ICL may not capture structure.
- If strict probabilistic uncertainty quantification is essential for decision making.
Failure Modes
- Prompt-order sensitivity: model predictions change with example ordering, hurting calibration unless shuffled.
- Majority-label bias for generative surrogate when the good/bad split τ is unbalanced.
- Over-confident MC estimates if naive sampling is used without prompt shuffling.
- LLM may propose invalid or out-of-range hyperparameters if instructions are incomplete (lower acceptance rate without proper instructions).
Core Entities
Models
- gpt-3.5-turbo
- Gaussian Process
- SMAC (RandomForest surrogate)
- TPE
- LLAMBO (this work)
Metrics
- normalized regret
- NRMSE
- R^2
- log predictive density (LPD)
- coverage
- sharpness
- generalized variance
- log-likelihood
Datasets
- Bayesmark
- HPOBench
- SEER (private)
- MAGGIC (private)
- CUTRACT (private)
- Rosenbrock (synthetic)
- Griewank (synthetic)
- KTablet (synthetic)
Benchmarks
- Bayesmark
- HPOBench
Context Entities
Models
- prompt templates for warmstarting
- ICL (in-context learning) few-shot prompts
- discriminative surrogate via MC/shuffle
- generative surrogate as binary classifier

