Automatic Prompt Search Papers — Parsed & Scored for Practitioners

Prompt LLMs to propose hyperparameters and training code; they match or beat standard HPO early in search.

0.60

0.70

7

LLMs can find better hyperparameters faster than random search in low-budget settings, speeding model iteration and cutting compute cost when trials are expensive.

Key finding

GPT-4 Turbo beats random search on HPOBench in the 10-evaluation setting.

Numbers: Beats random 81.25%; median error change 13.70%; mean change 19.83% (Table 1).

PREFER: automatically grow and weight prompts by having an LLM reflect on its errors and refine prompts

0.60

0.65

0.70

6

PREFER automates prompt generation and ensemble weighting, yielding better few-shot accuracy and much lower API/time cost than heavy prompt-search methods.

Key finding

PREFER improves few-shot F1 across several NLI/classification tasks versus single-prompt baselines and prior prompt-ensemble/iterative methods.

Numbers: QNLI: 0.793 vs 0.720 (synonym ensemble); Liar: 0.744 vs 0.572 (synonym)

Pick the best prompt per query offline using inverse RL and cheap embeddings

0.70

0.60

0.80

5

You can cut prompt-evaluation costs and improve per-query outputs by training a small offline reward model on past prompt logs and using it to pick prompts instead of repeatedly calling expensive LLM verification.

Key finding

Prompt-OIRL improves correctness when only one demonstration prompt is available.

Numbers: +24.3%

Train a small-model retriever to pick natural-language demonstrations that boost zero-shot LLMs across tasks and models

0.60

0.70

2

Train a single lightweight retriever once (with a small model) to boost many larger LLMs at inference, cutting repeated fine-tuning costs and improving zero-shot accuracy on many NLU tasks.

Key finding

UPRISE raises zero-shot Reading Comprehension average on GPT‑Neo‑2.7B from 31.6 to 40.1 (absolute gain).

Numbers: 31.6 -> 40.1 (+8.5 pp)

SEE: a quad‑phased, operator-driven system that jointly optimizes instructions and examples to make LLM prompts stronger and cheaper

0.60

0.70

0.80

1

SEE finds stronger prompts with far fewer API calls and tokens, so teams can improve LLM task accuracy while cutting prompt optimization cost and speeding experimentation.

Key finding

On hard BBH tasks SEE improves final test accuracy vs prior SOTA by double‑digit points.

Numbers: avg +13.94 percentage points on BBH (8 tasks)

Let a CEO→Manager→Worker hierarchy auto-write better prompts and improve zero-shot LLM outputs

0.60

0.50

1

HMAW automates prompt tuning without training and boosts response quality across varied tasks, letting teams improve outputs quickly while avoiding dataset-specific finetuning.

Key finding

Average preference score across five tasks increases by 30.7 percentage points

Numbers: Avg pref: 69.2% (HMAW) vs 38.5% (no prompt); +30.7 pts

Pick demonstrations that match an LLM's syntax and ChatGPT can beat a supervised OpenIE model in 6-shot

0.40

0.60

0.50

1

You can improve black-box LLM few-shot extraction by choosing demos that match the model's own output style; this yields production-quality gains without fine-tuning and reduces annotation cost for Open IE.

Key finding

Syntactic mismatch predicts extraction errors for ChatGPT.

Numbers: R^2 = 0.58 (Figure 1 correlation)

Auto-differentiate entire LLM pipelines so prompts across multi-node and agentic workflows are optimized automatically

0.60

0.70

0

Automates and concentrates prompt tuning across complex LLM pipelines, reducing manual engineering time and often improving accuracy while lowering token costs.

Key finding

On the ObjectCount single-LLM task, LLM-AutoDiff achieved 93.75% test EM vs Text-Grad's 84.5% on the reported split.

Numbers: Test EM: Ours 93.75% vs TG 84.5% (Table 2)

Create short, human-readable persona prompts from a few user preference pairs to improve personalized reward judgments

0.60

0.70

0.50

0

SynthesizeMe yields interpretable persona prompts that improve in-context judgment of user preferences without full model finetuning; useful when collecting a few pairwise judgments is feasible but large-scale retraining is not.

Key finding

SynthesizeMe boosts LLM-as-a-judge accuracy on Chatbot Arena.

Numbers: up to +4.4% absolute accuracy (Chatbot Arena)

Find better natural-language prompts by searching in embedding space

0.30

0.60

0.40

0

You can improve the performance of an API-only LLM on a task without model access or costly fine-tuning. That lowers operational friction for domain teams who need better outputs fast.

Key finding

Latent space search found a prompt that raised test accuracy from 75.36% to 78.14% on Financial PhraseBank.

Numbers: 75.36% -> 78.14% (+2.78 pp)

Use a feature-based prompt space plus a Knowledge-Gradient policy to find strong prompts in 30 or fewer costly LLM evaluations

0.60

0.55

0.65

0

SOPL finds better human-readable prompts with far fewer costly LLM evaluations by modeling prompt features and choosing experiments adaptively, lowering API costs and time for deploying LLM-based features.

Key finding

SOPL using KG achieves the highest average test accuracy across 13 challenging tasks.

Numbers: SOPL-KG mean test score 0.6281 vs EvoPrompt 0.5900 (Table 2).

AutoHD: ask an LLM to write Python heuristics, evolve them, and use those heuristics to guide search at inference time

0.65

0.60

0.55

0

AutoHD improves LLM planning accuracy without extra model training and produces interpretable Python heuristics you can inspect and reuse.

Key finding

AutoHD substantially improves planning accuracy on Blocksworld when compared to baselines.

Numbers: AutoHD All accuracies: 42.4% (GPT-4o-mini), 75.1% (GPT-4o), 59.1% (LLaMA 3.1 70B)

Learn 5-token language triggers to boost multilingual LLM accuracy by ~3.7–19.9% on Global MMLU

0.60

0.70

0

PolyPrompt offers a low-cost way to raise non-English QA accuracy by training a few small embeddings per language instead of costly model fine-tuning.

Key finding

PolyPrompt improves multilingual multiple-choice accuracy across tested languages.

Numbers: Absolute gains reported: 3.7%–19.9% (across languages, Table 1 / Abstract)

Optimize prompts by minimizing token-level loss — no sampling, no external judges

0.70

0.60

0.80

0

PMPO cuts evaluation cost by using log‑probabilities instead of sampling and external judges, enabling faster prompt tuning for deployed models and improving midsize model outputs without fine‑tuning.

Key finding

PMPO attains the highest average accuracy on BBH in the 1‑shot Qwen2.5‑14B experiments.

Numbers: Average accuracy 80.6% vs EvoPrompt 78.0% and OPRO 77.1%

metaTextGrad: Meta-learn prompts and pipelines for LLM-based optimizers to boost task accuracy

0.60

0.70

0

meta-learning optimizer prompts and compositions can boost task accuracy and reduce model cost by letting cheaper program models be amplified by smarter optimizer/meta calls.

Key finding

metaTextGrad raises average test accuracy versus best baseline on evaluated benchmarks.

Numbers: Avg test acc 0.53 vs 0.47 (+0.06)

Combine a structure-aware GP with Hyperband to find good prompts with far fewer API calls

0.60

0.70

0

HbBoPs reduces the number of expensive LLM API calls needed to find a good static prompt, cutting cost and time for model-driven features that rely on single-prompt deployments.

Key finding

On average HbBoPs produced the lowest normalized test error across methods.

Numbers: Avg normalized test error 0.150 vs HDBO 0.185 (Section 5.1)

Evaluate and optimize prompts without gold labels using self- and mutual-consistency

0.60

0.70

0

GLaPE lets teams optimize prompts without costly labels, enabling prompt tuning for private models and new tasks while cutting annotation costs.

Key finding

GLaPE-guided prompt optimization matches or closely trails label-based optimization on standard reasoning benchmarks.

Numbers: GSM8K: GLaPE 77.7% vs OPRO 76.6%; MultiArith: 99.3% vs 99.6% (Table 3)

Use LLMs to automatically invent better tensor-network structure search algorithms

0.40

0.70

0.60

0

Automating algorithm discovery with LLMs can find improved model-compression strategies and reduce expert hours needed to hand-craft search heuristics.

Key finding

tnGPS produced best-found algorithm with a much lower objective value than baselines on the reported benchmark.

Numbers: Objective: baseline 0.1558 -> tnGPS 0.1102

Find a model's true knowledge boundary by optimizing prompts that preserve meaning

0.60

0.50

0

Fixed-question testing can hide or undercount model knowledge and lead to poor model choices; optimized, semantics-preserving prompt search reveals a model's true answerable range so teams can pick models that actually cover needed domain facts.

Key finding

PGDC finds more answerable items than standard prompting on common-knowledge benchmarks.

Numbers: LLaMA2 success: PGDC 71.36% vs P-few 66.95% vs zero 34.43%

Automate and iteratively improve text prompts using a dual-LLM generator + corrector to reduce hallucinations

0.50

0.60

0

SPT can raise task accuracy significantly without costly model fine-tuning; it offers a lower-barrier way to boost product QA and reduce hallucinations if you can afford extra API/computation and representative training data.

Key finding

SPT raised GPT‑4 accuracy on GSM8K from 65.8% to 94.1% on evaluated splits.

Numbers: 65.8% -> 94.1% (+28.3 pp); Table 2

Retrieve similar QA examples on the fly so LLMs write correct SPARQL without fine-tuning

0.60

0.70

0

DFSL gives near state-of-the-art KGQA without dataset fine-tuning, cutting training cost and enabling faster deployment across knowledge graphs.

Key finding

DFSL can turn an unfinetuned LLM into a competitive KGQA system.

Numbers: LC-QuAD 2.0 F1: zero-shot 38.40 → DFSL 85.45 (+47.05)

GRAD-SUM: summarize model feedback to automatically produce generalizable prompts

0.60

0.50

0.60

0

Automates prompt tuning for black-box LLMs, cutting manual time and raising held-out performance by ~14% on tested tasks.

Key finding

Average improvement over initial prompts across tested datasets

Numbers: avg +14% final validation rating

Pick a small, high-impact set of unlabeled examples to label using graph diffusion and boost in‑context learning.

0.60

0.70

0

Label fewer examples and get nearly the same or better in-context performance while cutting selection time and inference cost; this lowers annotation bills and speeds up prompt curation.

Key finding

IDEAL outperforms Votek and random selection in most evaluations.

Numbers: Better in 17 out of 18 eval cases across 9 datasets