Combine a structure-aware GP with Hyperband to find good prompts with far fewer API calls

Overview

Decision SnapshotReady For Pilot

The method is practical (works with API-only LLMs), evaluated on many tasks and LLMs, and shows consistent gains; costs are mainly extra embedding computation and GP training.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Lennart Schneider, Martin Wistuba, Aaron Klein, Jacek Golebiowski, Giovanni Zappella, Felice Antonio Merra

Links

Abstract / PDF

Why It Matters For Business

HbBoPs reduces the number of expensive LLM API calls needed to find a good static prompt, cutting cost and time for model-driven features that rely on single-prompt deployments.

Who Should Care

Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

HbBoPs is a practical algorithm for selecting a single, high-performing prompt for black-box LLMs (API-only access). It learns a surrogate model that embeds instructions and few-shot exemplars separately (structural-aware deep-kernel Gaussian Process) and plugs that surrogate into Hyperband (multi-fidelity scheduler over validation instances). Across 10 benchmarks and 3 LLMs, HbBoPs finds better prompts with fewer LLM calls than full-fidelity Bayesian optimization and bandit baselines. Key wins: better anytime performance under tight budgets and robust behavior across common encoder choices.

Problem Statement

Selecting one prompt for a black-box LLM is hard because instructions and exemplars form a combinatorial space, the model gives no gradients, and each prompt evaluation requires many costly API calls on validation instances. The problem is to find the best prompt using as few LLM calls as possible while avoiding noisy decisions from small validation subsamples.

Main Contribution

A structural-aware deep-kernel Gaussian Process that embeds instruction and exemplar components separately and learns a joint low-dimensional latent representation aligned with prompt performance.

The use of Hyperband as a multi-fidelity scheduler where fidelity = number of validation instances, enabling early termination of poor prompts and fewer LLM calls.

Key Findings

On average HbBoPs produced the lowest normalized test error across methods.

NumbersAvg normalized test error 0.150 vs HDBO 0.185 (Section 5.1)

Practical UseIf you must pick one static prompt under a query budget, HbBoPs yields better final prompts than standard full-fidelity BO baselines on the evaluated tasks.

Evidence RefSection 5.1, Figure 1

HbBoPs improves early (anytime) performance under tight budgets.

NumbersAt 0.25 of budget: ≈35% better than HDBO and 24% better than TRIPLE-SH (Section 5.1)

Practical UseWhen you have limited API budget, run HbBoPs to get stronger prompts faster instead of waiting for full-fidelity methods.

Evidence RefSection 5.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
normalized test error (avg across tasks & LLMs)	0.150 (HbBoPs)	HDBO 0.185; RS 0.214	-0.035 vs HDBO	averaged over 10 benchmarks and 3 LLMs	Section 5.1, Figure 1	Section 5.1
median relative improvement over TRIPLE-SH	validation +0.121 ; test +0.066 (Claude 3 Haiku, 0.25 budget)	TRIPLE-SH	validation +0.121 ; test +0.066	Table 2 per LLM	Table 2 (Section 5.2)	Table 2

What To Try In 7 Days

Run HbBoPs on one production task using your usual validation split and compare the best prompt after N LLM calls to your current prompt.

Replace random/full evaluations with a Hyperband schedule and cache prompt-instance outputs to cut repeated LLM calls.

Start with a standard encoder (BERT/MPNet) and the structural prompt split (instruction vs exemplar) before investing in encoder tuning.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Relies on pre-trained encoder embeddings; embeddings add compute overhead and potential blind spots.

Experiments focus on prompts composed of instruction + few-shot exemplar; other prompt parts (formatting, output guidance) are not evaluated.

When Not To Use

When you have white-box access and can use gradient-based prompt optimization (different tooling and goals).

When the available validation set is extremely small (e.g., single-digit instances), because estimates are too noisy.

Failure Modes

If validation subsamples are too small, Hyperband may still discard promising prompts due to noise unless bracketing is tuned.

Poor-quality embeddings that do not separate prompt effects can weaken the DK-GP surrogate.

Core Entities

Models

Claude 3 HaikuLLAMA3 8B InstructMistral 7B InstructBERTMPNetDistillRoBERTa

Metrics

normalized validation errornormalized test errortotal LLM calls (cost metric)

Datasets

GSM8KAI2 ARCBIG-bench BBII subset (anton yms, larger animal, negation, second word letter, sentiment, object cou

Benchmarks

10 prompt-selection benchmarks (GSM8K, AI2 ARC, 8 BBII tasks)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

On average HbBoPs produced the lowest normalized test error across methods.

HbBoPs improves early (anytime) performance under tight budgets.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

BLPO: jointly optimize judge and caption prompts to better align multimodal LLM judges with human image judgments

Key finding

AutoPDL: AutoML that finds and returns editable, executable prompt programs for LLM agents

Key finding

Use evolutionary search to generate harmless prompts that trigger unnecessary LLM refusals, build tests and alignment data, and reduce over‑

Key finding

Find a model's true knowledge boundary by optimizing prompts that preserve meaning

Key finding

IPOMP: pick a small, diverse evaluation set and refine it from live model feedback to get better and more stable prompts

Key finding