Combine a structure-aware GP with Hyperband to find good prompts with far fewer API calls

December 10, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Lennart Schneider, Martin Wistuba, Aaron Klein, Jacek Golebiowski, Giovanni Zappella, Felice Antonio Merra

Links

Abstract / PDF

Why It Matters For Business

HbBoPs reduces the number of expensive LLM API calls needed to find a good static prompt, cutting cost and time for model-driven features that rely on single-prompt deployments.

Summary TLDR

HbBoPs is a practical algorithm for selecting a single, high-performing prompt for black-box LLMs (API-only access). It learns a surrogate model that embeds instructions and few-shot exemplars separately (structural-aware deep-kernel Gaussian Process) and plugs that surrogate into Hyperband (multi-fidelity scheduler over validation instances). Across 10 benchmarks and 3 LLMs, HbBoPs finds better prompts with fewer LLM calls than full-fidelity Bayesian optimization and bandit baselines. Key wins: better anytime performance under tight budgets and robust behavior across common encoder choices.

Problem Statement

Selecting one prompt for a black-box LLM is hard because instructions and exemplars form a combinatorial space, the model gives no gradients, and each prompt evaluation requires many costly API calls on validation instances. The problem is to find the best prompt using as few LLM calls as possible while avoiding noisy decisions from small validation subsamples.

Main Contribution

A structural-aware deep-kernel Gaussian Process that embeds instruction and exemplar components separately and learns a joint low-dimensional latent representation aligned with prompt performance.

The use of Hyperband as a multi-fidelity scheduler where fidelity = number of validation instances, enabling early termination of poor prompts and fewer LLM calls.

HbBoPs: replace Hyperband's random proposals with a Bayesian-Optimization (Expected Improvement) proposal from the trained DK-GP, yielding both sample- and query-efficiency.

Extensive benchmarks on 10 tasks and 3 LLMs showing stronger anytime and final performance than several state-of-the-art baselines, plus ablations and encoder-sensitivity analysis.

Key Findings

On average HbBoPs produced the lowest normalized test error across methods.

NumbersAvg normalized test error 0.150 vs HDBO 0.185 (Section 5.1)

HbBoPs improves early (anytime) performance under tight budgets.

NumbersAt 0.25 of budget: ≈35% better than HDBO and 24% better than TRIPLE-SH (Section 5.1)

Both design pieces matter: structural-aware DK-GP + Hyperband add large gains over vanilla BO.

NumbersAblations show HbBoPs improves over vanilla BO by ~66% (0.5 budget) and ~67% (1.0) in normalized validation error (Sec.

Performance is robust to choice of off-the-shelf encoder.

NumbersSimilar test error across encoders (BERT 0.150, MPNet 0.158, DistillRoBERTa 0.150 at full budget) (Table 3).

Results

normalized test error (avg across tasks & LLMs)

Value0.150 (HbBoPs)

BaselineHDBO 0.185; RS 0.214

median relative improvement over TRIPLE-SH

Valuevalidation +0.121 ; test +0.066 (Claude 3 Haiku, 0.25 budget)

BaselineTRIPLE-SH

encoder sensitivity (normalized test error)

ValueBERT 0.150, MPNet 0.158, DistillRoBERTa 0.150 (full budget)

Baseline

Who Should Care

What To Try In 7 Days

Run HbBoPs on one production task using your usual validation split and compare the best prompt after N LLM calls to your current prompt.

Replace random/full evaluations with a Hyperband schedule and cache prompt-instance outputs to cut repeated LLM calls.

Start with a standard encoder (BERT/MPNet) and the structural prompt split (instruction vs exemplar) before investing in encoder tuning.

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Relies on pre-trained encoder embeddings; embeddings add compute overhead and potential blind spots.
  • Experiments focus on prompts composed of instruction + few-shot exemplar; other prompt parts (formatting, output guidance) are not evaluated.
  • Multi-fidelity benefits assume caching and relatively stable LLM outputs; high stochasticity reduces caching gains.
  • Validation-set size matters: using very small validation subsets can produce noisy decisions despite Hyperband.

When Not To Use

  • When you have white-box access and can use gradient-based prompt optimization (different tooling and goals).
  • When the available validation set is extremely small (e.g., single-digit instances), because estimates are too noisy.
  • If you cannot cache model outputs or if LLM outputs are highly non-deterministic at your operating temperature.

Failure Modes

  • If validation subsamples are too small, Hyperband may still discard promising prompts due to noise unless bracketing is tuned.
  • Poor-quality embeddings that do not separate prompt effects can weaken the DK-GP surrogate.
  • Misconfigured Hyperband parameters (b_min, η) can hurt early-stage exploration or lead to wasted budget.

Core Entities

Models

  • Claude 3 Haiku
  • LLAMA3 8B Instruct
  • Mistral 7B Instruct
  • BERT
  • MPNet
  • DistillRoBERTa

Metrics

  • normalized validation error
  • normalized test error
  • total LLM calls (cost metric)

Datasets

  • GSM8K
  • AI2 ARC
  • BIG-bench BBII subset (anton yms, larger animal, negation, second word letter, sentiment, object cou

Benchmarks

  • 10 prompt-selection benchmarks (GSM8K, AI2 ARC, 8 BBII tasks)