Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
HbBoPs reduces the number of expensive LLM API calls needed to find a good static prompt, cutting cost and time for model-driven features that rely on single-prompt deployments.
Summary TLDR
HbBoPs is a practical algorithm for selecting a single, high-performing prompt for black-box LLMs (API-only access). It learns a surrogate model that embeds instructions and few-shot exemplars separately (structural-aware deep-kernel Gaussian Process) and plugs that surrogate into Hyperband (multi-fidelity scheduler over validation instances). Across 10 benchmarks and 3 LLMs, HbBoPs finds better prompts with fewer LLM calls than full-fidelity Bayesian optimization and bandit baselines. Key wins: better anytime performance under tight budgets and robust behavior across common encoder choices.
Problem Statement
Selecting one prompt for a black-box LLM is hard because instructions and exemplars form a combinatorial space, the model gives no gradients, and each prompt evaluation requires many costly API calls on validation instances. The problem is to find the best prompt using as few LLM calls as possible while avoiding noisy decisions from small validation subsamples.
Main Contribution
A structural-aware deep-kernel Gaussian Process that embeds instruction and exemplar components separately and learns a joint low-dimensional latent representation aligned with prompt performance.
The use of Hyperband as a multi-fidelity scheduler where fidelity = number of validation instances, enabling early termination of poor prompts and fewer LLM calls.
HbBoPs: replace Hyperband's random proposals with a Bayesian-Optimization (Expected Improvement) proposal from the trained DK-GP, yielding both sample- and query-efficiency.
Extensive benchmarks on 10 tasks and 3 LLMs showing stronger anytime and final performance than several state-of-the-art baselines, plus ablations and encoder-sensitivity analysis.
Key Findings
On average HbBoPs produced the lowest normalized test error across methods.
HbBoPs improves early (anytime) performance under tight budgets.
Both design pieces matter: structural-aware DK-GP + Hyperband add large gains over vanilla BO.
Performance is robust to choice of off-the-shelf encoder.
Results
normalized test error (avg across tasks & LLMs)
median relative improvement over TRIPLE-SH
encoder sensitivity (normalized test error)
Who Should Care
What To Try In 7 Days
Run HbBoPs on one production task using your usual validation split and compare the best prompt after N LLM calls to your current prompt.
Replace random/full evaluations with a Hyperband schedule and cache prompt-instance outputs to cut repeated LLM calls.
Start with a standard encoder (BERT/MPNet) and the structural prompt split (instruction vs exemplar) before investing in encoder tuning.
Reproducibility
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Relies on pre-trained encoder embeddings; embeddings add compute overhead and potential blind spots.
- Experiments focus on prompts composed of instruction + few-shot exemplar; other prompt parts (formatting, output guidance) are not evaluated.
- Multi-fidelity benefits assume caching and relatively stable LLM outputs; high stochasticity reduces caching gains.
- Validation-set size matters: using very small validation subsets can produce noisy decisions despite Hyperband.
When Not To Use
- When you have white-box access and can use gradient-based prompt optimization (different tooling and goals).
- When the available validation set is extremely small (e.g., single-digit instances), because estimates are too noisy.
- If you cannot cache model outputs or if LLM outputs are highly non-deterministic at your operating temperature.
Failure Modes
- If validation subsamples are too small, Hyperband may still discard promising prompts due to noise unless bracketing is tuned.
- Poor-quality embeddings that do not separate prompt effects can weaken the DK-GP surrogate.
- Misconfigured Hyperband parameters (b_min, η) can hurt early-stage exploration or lead to wasted budget.
Core Entities
Models
- Claude 3 Haiku
- LLAMA3 8B Instruct
- Mistral 7B Instruct
- BERT
- MPNet
- DistillRoBERTa
Metrics
- normalized validation error
- normalized test error
- total LLM calls (cost metric)
Datasets
- GSM8K
- AI2 ARC
- BIG-bench BBII subset (anton yms, larger animal, negation, second word letter, sentiment, object cou
Benchmarks
- 10 prompt-selection benchmarks (GSM8K, AI2 ARC, 8 BBII tasks)

