Use a feature-based prompt space plus a Knowledge-Gradient policy to find strong prompts in 30 or fewer costly LLM evaluations

January 7, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.55

Cost Impact Score

0.65

Citation Count

0

Authors

Shuyang Wang, Somayeh Moazeni, Diego Klabjan

Links

Abstract / PDF

Why It Matters For Business

SOPL finds better human-readable prompts with far fewer costly LLM evaluations by modeling prompt features and choosing experiments adaptively, lowering API costs and time for deploying LLM-based features.

Summary TLDR

The authors present SOPL, a feature-based, Bayesian sequential learning framework for automated prompt engineering that uses the Knowledge-Gradient (KG) policy to pick which prompts to evaluate next. SOPL models prompts as interpretable feature vectors (template, examples, roles, paraphrase, tone), learns correlations across features with Bayesian regression, and computes KG decisions via mixed-integer conic optimization. On 13 challenging instruction-induction tasks evaluated with GPT-3.5, SOPL with KG finds higher-quality prompts within 30 or fewer evaluations and is especially advantageous when model outputs are highly sensitive to prompt choices.

Problem Statement

Manual prompt design is slow and brittle because LLM outputs vary widely with small prompt changes. Existing automated methods either search a fixed candidate set or require many evaluations. Real applications often allow only a small number of expensive evaluations. The paper asks: how to efficiently find high-quality, human-readable prompts from a large constrained feature space when evaluation budget is limited?

Main Contribution

A feature-based, interpretable prompt representation capturing template, demonstrations, roles, paraphrasing, and tone to expand the search space beyond enumerated candidates.

A Bayesian regression model that shares information across prompts and supports correlated beliefs and priors for feature effects.

Applying the forward-looking Knowledge-Gradient (KG) policy, computed via mixed-integer conic optimization, to select prompts sequentially under limited evaluation budgets; empirical results show KG outperforms baselines on instruction-induction tasks.

Key Findings

SOPL using KG achieves the highest average test accuracy across 13 challenging tasks.

NumbersSOPL-KG mean test score 0.6281 vs EvoPrompt 0.5900 (Table 2).

SOPL-KG shows a consistent relative improvement over EvoPrompt and other baselines.

Numbers6.47% higher average test score vs EvoPrompt and 11.99% vs TRIPLE (Table 2).

KG retains advantages with fewer evaluations and supports early stopping.

NumbersWith N=20 iterations SOPL-KG mean 0.6174; with early stop τ=10 test score 0.6060 using ~16.9 steps (Tables 3 and 4).

KG advantage grows when LLM outputs are sensitive to prompt features.

NumbersCorrelation between coefficient of variation and KG advantage: ρ=0.8901 vs TS, ρ=0.7789 vs Greedy (Section 6.2).

Enriching prompt representation with multiple features improves final prompts over searching only one feature.

NumbersRemoving features reduces average test score for hard tasks orthography_starts_with and rhymes (Figure 7, Section 6.3).

Results

Average test score (13 tasks)

Value0.6281

BaselineEvoPrompt 0.5900

SOPL-KG mean test score (N=20)

Value0.6174

BaselineSOPL-KG (N=30) 0.6281

SOPL-KG mean test score (N=10)

Value0.5800

BaselineEvoPrompt (N=10) 0.5625

Early stopping performance (τ=10)

Value0.6060 (mean)

BaselineSOPL-KG N=30 mean 0.6281

Who Should Care

What To Try In 7 Days

Define 4–6 interpretable prompt features for one task (template, examples, role, paraphrase, tone).

Build a small validation set and run a basic Bayesian update + greedy/TS baseline to measure variance across prompts.

Run SOPL-KG (or TS if KG solver unavailable) for N≈20–30 evaluations and compare final test accuracy vs current prompts.

Reproducibility

Data Urls

  • Instruction Induction dataset (Honovich et al., 2022)

Data Available

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Evaluated only on instruction-induction tasks; other task types may behave differently.
  • Experiments use GPT-3.5 only, so transfer to other LLMs is untested.
  • Requires designing discrete interpretable features; poor feature choices limit performance.
  • KG decisions require solving mixed-integer conic problems, which adds computational overhead.

When Not To Use

  • If prompt landscape is flat (LLM insensitive), greedy search suffices and KG adds overhead.
  • When you can run thousands of cheap evaluations, simpler population methods may match performance.
  • If you require white-box gradient methods (you have model weights) for token-level tuning.

Failure Modes

  • Overfitting validation set: best validation prompt may not generalize if validation is small or unrepresentative.
  • Poor priors or missing features can mislead Bayesian updates and waste evaluations.
  • Optimization solver may be slow on very large or highly constrained feature spaces.

Core Entities

Models

  • GPT-3.5

Metrics

  • Accuracy

Datasets

  • Instruction Induction dataset (Honovich et al., 2022)

Benchmarks

  • EvoPrompt
  • TRIPLE
  • Thompson Sampling
  • Greedy