Use a feature-based prompt space plus a Knowledge-Gradient policy to find strong prompts in 30 or fewer costly LLM evaluations

January 7, 20257 min

Overview

Decision SnapshotNeeds Validation

The method is well validated on 13 tough instruction-induction tasks with repeated trials; gains are empirical and task-dependent, and results rely on GPT-3.5 evaluations and specific feature choices.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: No

At A Glance

Cost impact: 65%

Production readiness: 60%

Novelty: 55%

Authors

Shuyang Wang, Somayeh Moazeni, Diego Klabjan

Links

Abstract / PDF / Data

Why It Matters For Business

SOPL finds better human-readable prompts with far fewer costly LLM evaluations by modeling prompt features and choosing experiments adaptively, lowering API costs and time for deploying LLM-based features.

Who Should Care

Summary TLDR

The authors present SOPL, a feature-based, Bayesian sequential learning framework for automated prompt engineering that uses the Knowledge-Gradient (KG) policy to pick which prompts to evaluate next. SOPL models prompts as interpretable feature vectors (template, examples, roles, paraphrase, tone), learns correlations across features with Bayesian regression, and computes KG decisions via mixed-integer conic optimization. On 13 challenging instruction-induction tasks evaluated with GPT-3.5, SOPL with KG finds higher-quality prompts within 30 or fewer evaluations and is especially advantageous when model outputs are highly sensitive to prompt choices.

Problem Statement

Manual prompt design is slow and brittle because LLM outputs vary widely with small prompt changes. Existing automated methods either search a fixed candidate set or require many evaluations. Real applications often allow only a small number of expensive evaluations. The paper asks: how to efficiently find high-quality, human-readable prompts from a large constrained feature space when evaluation budget is limited?

Main Contribution

A feature-based, interpretable prompt representation capturing template, demonstrations, roles, paraphrasing, and tone to expand the search space beyond enumerated candidates.

A Bayesian regression model that shares information across prompts and supports correlated beliefs and priors for feature effects.

Key Findings

SOPL using KG achieves the highest average test accuracy across 13 challenging tasks.

NumbersSOPL-KG mean test score 0.6281 vs EvoPrompt 0.5900 (Table 2).

Practical UseIf you must pick a single automated method for limited-budget prompt search, SOPL-KG is likely to yield better final prompts than EvoPrompt or TRIPLE on similar instruction-induction tasks.

Evidence RefTable 2, Section 6

SOPL-KG shows a consistent relative improvement over EvoPrompt and other baselines.

Numbers6.47% higher average test score vs EvoPrompt and 11.99% vs TRIPLE (Table 2).

Practical UseExpect single-digit to low-double-digit relative gains in final task accuracy on comparable tasks when replacing evolutionary or bandit selection with KG.

Evidence RefTable 2, Section 6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average test score (13 tasks)0.6281EvoPrompt 0.5900+0.0381 (6.47%)13 challenging instruction-induction tasksTable 2 reports mean test scores across 20 replicationsTable 2, Section 6
SOPL-KG mean test score (N=20)0.6174SOPL-KG (N=30) 0.6281-0.010713 tasks, fewer evaluationsTable 3 shows performance at N=20Table 3, Section 6.1

What To Try In 7 Days

Define 4–6 interpretable prompt features for one task (template, examples, role, paraphrase, tone).

Build a small validation set and run a basic Bayesian update + greedy/TS baseline to measure variance across prompts.

Run SOPL-KG (or TS if KG solver unavailable) for N≈20–30 evaluations and compare final test accuracy vs current prompts.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusNo
LicenseUnknown

Data URLs

Instruction Induction dataset (Honovich et al., 2022)

Risks & Boundaries

Limitations

Evaluated only on instruction-induction tasks; other task types may behave differently.

Experiments use GPT-3.5 only, so transfer to other LLMs is untested.

When Not To Use

If prompt landscape is flat (LLM insensitive), greedy search suffices and KG adds overhead.

When you can run thousands of cheap evaluations, simpler population methods may match performance.

Failure Modes

Overfitting validation set: best validation prompt may not generalize if validation is small or unrepresentative.

Poor priors or missing features can mislead Bayesian updates and waste evaluations.

Core Entities

Models

GPT-3.5

Metrics

Accuracy

Datasets

Instruction Induction dataset (Honovich et al., 2022)

Benchmarks

EvoPromptTRIPLEThompson SamplingGreedy