Use a feature-based prompt space plus a Knowledge-Gradient policy to find strong prompts in 30 or fewer costly LLM evaluations

Overview

Decision SnapshotNeeds Validation

The method is well validated on 13 tough instruction-induction tasks with repeated trials; gains are empirical and task-dependent, and results rely on GPT-3.5 evaluations and specific feature choices.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: No

At A Glance

Cost impact: 65%

Production readiness: 60%

Novelty: 55%

Authors

Shuyang Wang, Somayeh Moazeni, Diego Klabjan

Links

Abstract / PDF / Data

Why It Matters For Business

SOPL finds better human-readable prompts with far fewer costly LLM evaluations by modeling prompt features and choosing experiments adaptively, lowering API costs and time for deploying LLM-based features.

Who Should Care

Product Manager ML Engineer Data Scientist Founder

Summary TLDR

The authors present SOPL, a feature-based, Bayesian sequential learning framework for automated prompt engineering that uses the Knowledge-Gradient (KG) policy to pick which prompts to evaluate next. SOPL models prompts as interpretable feature vectors (template, examples, roles, paraphrase, tone), learns correlations across features with Bayesian regression, and computes KG decisions via mixed-integer conic optimization. On 13 challenging instruction-induction tasks evaluated with GPT-3.5, SOPL with KG finds higher-quality prompts within 30 or fewer evaluations and is especially advantageous when model outputs are highly sensitive to prompt choices.

Problem Statement

Manual prompt design is slow and brittle because LLM outputs vary widely with small prompt changes. Existing automated methods either search a fixed candidate set or require many evaluations. Real applications often allow only a small number of expensive evaluations. The paper asks: how to efficiently find high-quality, human-readable prompts from a large constrained feature space when evaluation budget is limited?

Main Contribution

A feature-based, interpretable prompt representation capturing template, demonstrations, roles, paraphrasing, and tone to expand the search space beyond enumerated candidates.

A Bayesian regression model that shares information across prompts and supports correlated beliefs and priors for feature effects.

Key Findings

SOPL using KG achieves the highest average test accuracy across 13 challenging tasks.

NumbersSOPL-KG mean test score 0.6281 vs EvoPrompt 0.5900 (Table 2).

Practical UseIf you must pick a single automated method for limited-budget prompt search, SOPL-KG is likely to yield better final prompts than EvoPrompt or TRIPLE on similar instruction-induction tasks.

Evidence RefTable 2, Section 6

SOPL-KG shows a consistent relative improvement over EvoPrompt and other baselines.

Numbers6.47% higher average test score vs EvoPrompt and 11.99% vs TRIPLE (Table 2).

Practical UseExpect single-digit to low-double-digit relative gains in final task accuracy on comparable tasks when replacing evolutionary or bandit selection with KG.

Evidence RefTable 2, Section 6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average test score (13 tasks)	0.6281	EvoPrompt 0.5900	+0.0381 (6.47%)	13 challenging instruction-induction tasks	Table 2 reports mean test scores across 20 replications	Table 2, Section 6
SOPL-KG mean test score (N=20)	0.6174	SOPL-KG (N=30) 0.6281	-0.0107	13 tasks, fewer evaluations	Table 3 shows performance at N=20	Table 3, Section 6.1

What To Try In 7 Days

Define 4–6 interpretable prompt features for one task (template, examples, role, paraphrase, tone).

Build a small validation set and run a basic Bayesian update + greedy/TS baseline to measure variance across prompts.

Run SOPL-KG (or TS if KG solver unavailable) for N≈20–30 evaluations and compare final test accuracy vs current prompts.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusNo

LicenseUnknown

Data URLs

Instruction Induction dataset (Honovich et al., 2022)

Risks & Boundaries

Limitations

Evaluated only on instruction-induction tasks; other task types may behave differently.

Experiments use GPT-3.5 only, so transfer to other LLMs is untested.

When Not To Use

If prompt landscape is flat (LLM insensitive), greedy search suffices and KG adds overhead.

When you can run thousands of cheap evaluations, simpler population methods may match performance.

Failure Modes

Overfitting validation set: best validation prompt may not generalize if validation is small or unrepresentative.

Poor priors or missing features can mislead Bayesian updates and waste evaluations.

Core Entities

Models

GPT-3.5

Metrics

Accuracy

Datasets

Instruction Induction dataset (Honovich et al., 2022)

Benchmarks

EvoPromptTRIPLEThompson SamplingGreedy

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SOPL using KG achieves the highest average test accuracy across 13 challenging tasks.

SOPL-KG shows a consistent relative improvement over EvoPrompt and other baselines.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

BLPO: jointly optimize judge and caption prompts to better align multimodal LLM judges with human image judgments

Key finding

AutoPDL: AutoML that finds and returns editable, executable prompt programs for LLM agents

Key finding

Use evolutionary search to generate harmless prompts that trigger unnecessary LLM refusals, build tests and alignment data, and reduce over‑

Key finding

Find a model's true knowledge boundary by optimizing prompts that preserve meaning

Key finding

IPOMP: pick a small, diverse evaluation set and refine it from live model feedback to get better and more stable prompts

Key finding