Overview
Production Readiness
0.6
Novelty Score
0.55
Cost Impact Score
0.65
Citation Count
0
Why It Matters For Business
SOPL finds better human-readable prompts with far fewer costly LLM evaluations by modeling prompt features and choosing experiments adaptively, lowering API costs and time for deploying LLM-based features.
Summary TLDR
The authors present SOPL, a feature-based, Bayesian sequential learning framework for automated prompt engineering that uses the Knowledge-Gradient (KG) policy to pick which prompts to evaluate next. SOPL models prompts as interpretable feature vectors (template, examples, roles, paraphrase, tone), learns correlations across features with Bayesian regression, and computes KG decisions via mixed-integer conic optimization. On 13 challenging instruction-induction tasks evaluated with GPT-3.5, SOPL with KG finds higher-quality prompts within 30 or fewer evaluations and is especially advantageous when model outputs are highly sensitive to prompt choices.
Problem Statement
Manual prompt design is slow and brittle because LLM outputs vary widely with small prompt changes. Existing automated methods either search a fixed candidate set or require many evaluations. Real applications often allow only a small number of expensive evaluations. The paper asks: how to efficiently find high-quality, human-readable prompts from a large constrained feature space when evaluation budget is limited?
Main Contribution
A feature-based, interpretable prompt representation capturing template, demonstrations, roles, paraphrasing, and tone to expand the search space beyond enumerated candidates.
A Bayesian regression model that shares information across prompts and supports correlated beliefs and priors for feature effects.
Applying the forward-looking Knowledge-Gradient (KG) policy, computed via mixed-integer conic optimization, to select prompts sequentially under limited evaluation budgets; empirical results show KG outperforms baselines on instruction-induction tasks.
Key Findings
SOPL using KG achieves the highest average test accuracy across 13 challenging tasks.
SOPL-KG shows a consistent relative improvement over EvoPrompt and other baselines.
KG retains advantages with fewer evaluations and supports early stopping.
KG advantage grows when LLM outputs are sensitive to prompt features.
Enriching prompt representation with multiple features improves final prompts over searching only one feature.
Results
Average test score (13 tasks)
SOPL-KG mean test score (N=20)
SOPL-KG mean test score (N=10)
Early stopping performance (τ=10)
Who Should Care
What To Try In 7 Days
Define 4–6 interpretable prompt features for one task (template, examples, role, paraphrase, tone).
Build a small validation set and run a basic Bayesian update + greedy/TS baseline to measure variance across prompts.
Run SOPL-KG (or TS if KG solver unavailable) for N≈20–30 evaluations and compare final test accuracy vs current prompts.
Reproducibility
Data Urls
- Instruction Induction dataset (Honovich et al., 2022)
Data Available
Open Source Status
- no
Risks & Boundaries
Limitations
- Evaluated only on instruction-induction tasks; other task types may behave differently.
- Experiments use GPT-3.5 only, so transfer to other LLMs is untested.
- Requires designing discrete interpretable features; poor feature choices limit performance.
- KG decisions require solving mixed-integer conic problems, which adds computational overhead.
When Not To Use
- If prompt landscape is flat (LLM insensitive), greedy search suffices and KG adds overhead.
- When you can run thousands of cheap evaluations, simpler population methods may match performance.
- If you require white-box gradient methods (you have model weights) for token-level tuning.
Failure Modes
- Overfitting validation set: best validation prompt may not generalize if validation is small or unrepresentative.
- Poor priors or missing features can mislead Bayesian updates and waste evaluations.
- Optimization solver may be slow on very large or highly constrained feature spaces.
Core Entities
Models
- GPT-3.5
Metrics
- Accuracy
Datasets
- Instruction Induction dataset (Honovich et al., 2022)
Benchmarks
- EvoPrompt
- TRIPLE
- Thompson Sampling
- Greedy

