Combine a structure-aware GP with Hyperband to find good prompts with far fewer API calls

December 10, 20247 min

Overview

Decision SnapshotReady For Pilot

The method is practical (works with API-only LLMs), evaluated on many tasks and LLMs, and shows consistent gains; costs are mainly extra embedding computation and GP training.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Lennart Schneider, Martin Wistuba, Aaron Klein, Jacek Golebiowski, Giovanni Zappella, Felice Antonio Merra

Links

Abstract / PDF

Why It Matters For Business

HbBoPs reduces the number of expensive LLM API calls needed to find a good static prompt, cutting cost and time for model-driven features that rely on single-prompt deployments.

Who Should Care

Summary TLDR

HbBoPs is a practical algorithm for selecting a single, high-performing prompt for black-box LLMs (API-only access). It learns a surrogate model that embeds instructions and few-shot exemplars separately (structural-aware deep-kernel Gaussian Process) and plugs that surrogate into Hyperband (multi-fidelity scheduler over validation instances). Across 10 benchmarks and 3 LLMs, HbBoPs finds better prompts with fewer LLM calls than full-fidelity Bayesian optimization and bandit baselines. Key wins: better anytime performance under tight budgets and robust behavior across common encoder choices.

Problem Statement

Selecting one prompt for a black-box LLM is hard because instructions and exemplars form a combinatorial space, the model gives no gradients, and each prompt evaluation requires many costly API calls on validation instances. The problem is to find the best prompt using as few LLM calls as possible while avoiding noisy decisions from small validation subsamples.

Main Contribution

A structural-aware deep-kernel Gaussian Process that embeds instruction and exemplar components separately and learns a joint low-dimensional latent representation aligned with prompt performance.

The use of Hyperband as a multi-fidelity scheduler where fidelity = number of validation instances, enabling early termination of poor prompts and fewer LLM calls.

Key Findings

On average HbBoPs produced the lowest normalized test error across methods.

NumbersAvg normalized test error 0.150 vs HDBO 0.185 (Section 5.1)

Practical UseIf you must pick one static prompt under a query budget, HbBoPs yields better final prompts than standard full-fidelity BO baselines on the evaluated tasks.

Evidence RefSection 5.1, Figure 1

HbBoPs improves early (anytime) performance under tight budgets.

NumbersAt 0.25 of budget: ≈35% better than HDBO and 24% better than TRIPLE-SH (Section 5.1)

Practical UseWhen you have limited API budget, run HbBoPs to get stronger prompts faster instead of waiting for full-fidelity methods.

Evidence RefSection 5.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
normalized test error (avg across tasks & LLMs)0.150 (HbBoPs)HDBO 0.185; RS 0.214-0.035 vs HDBOaveraged over 10 benchmarks and 3 LLMsSection 5.1, Figure 1Section 5.1
median relative improvement over TRIPLE-SHvalidation +0.121 ; test +0.066 (Claude 3 Haiku, 0.25 budget)TRIPLE-SHvalidation +0.121 ; test +0.066Table 2 per LLMTable 2 (Section 5.2)Table 2

What To Try In 7 Days

Run HbBoPs on one production task using your usual validation split and compare the best prompt after N LLM calls to your current prompt.

Replace random/full evaluations with a Hyperband schedule and cache prompt-instance outputs to cut repeated LLM calls.

Start with a standard encoder (BERT/MPNet) and the structural prompt split (instruction vs exemplar) before investing in encoder tuning.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Relies on pre-trained encoder embeddings; embeddings add compute overhead and potential blind spots.

Experiments focus on prompts composed of instruction + few-shot exemplar; other prompt parts (formatting, output guidance) are not evaluated.

When Not To Use

When you have white-box access and can use gradient-based prompt optimization (different tooling and goals).

When the available validation set is extremely small (e.g., single-digit instances), because estimates are too noisy.

Failure Modes

If validation subsamples are too small, Hyperband may still discard promising prompts due to noise unless bracketing is tuned.

Poor-quality embeddings that do not separate prompt effects can weaken the DK-GP surrogate.

Core Entities

Models

Claude 3 HaikuLLAMA3 8B InstructMistral 7B InstructBERTMPNetDistillRoBERTa

Metrics

normalized validation errornormalized test errortotal LLM calls (cost metric)

Datasets

GSM8KAI2 ARCBIG-bench BBII subset (anton yms, larger animal, negation, second word letter, sentiment, object cou

Benchmarks

10 prompt-selection benchmarks (GSM8K, AI2 ARC, 8 BBII tasks)