Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
SEE finds stronger prompts with far fewer API calls and tokens, so teams can improve LLM task accuracy while cutting prompt optimization cost and speeding experimentation.
Summary TLDR
SEE is a prompt‑search system that treats a full prompt (instruction + few‑shot examples) as one optimization target. It runs a four‑phase loop that alternates focused local operators (fast local fixes) with global fusion operators (search across candidates). On 35 public tasks SEE often finds stronger prompts than recent baselines while using far fewer LLM API calls and tokens. The method is model‑agnostic, uses five LLM operators (Lamarckian, EDA, Crossover, Feedback, Semantic), and adds two practical tweaks: performance‑based vectors with Hamming distance and adaptive phase stop rules.
Problem Statement
Current automatic prompt search usually optimizes instruction text or example selection separately. That splits the prompt and misses interactions between instruction and examples. Jointly optimizing both is combinatorial, expensive, and hard to converge. The problem: how to search this high‑dimensional discrete space efficiently and reliably so prompts work better while keeping API/token cost reasonable.
Main Contribution
Formulate cohesive prompt optimization: jointly search instruction + examples to find prompts that work together.
Design SEE, a quad‑phased metaheuristic that alternates exploration (global operators) and exploitation (local operators) and adaptively picks LLM operators.
Two practical additions: use performance vectors + Hamming distance to measure candidate diversity, and adaptive phase stop rules to limit wasted API calls.
Extensive evaluation on 35 tasks vs 9 baselines showing higher accuracy and lower computational cost.
Key Findings
On hard BBH tasks SEE improves final test accuracy vs prior SOTA by double‑digit points.
SEE cuts prompt optimization compute (API calls and tokens) substantially versus evolutionary/metaheuristic baselines.
Performance‑based vectors + Hamming distance help select diverse parents and improve search.
Different operators have distinct roles: Feedback converges fast; EDA/Crossover improve exploration.
Results
Accuracy
compute (API calls / tokens)
per-task gains vs AELP
Who Should Care
What To Try In 7 Days
Run SEE (GPT‑3.5) on one hard task you care about and compare final dev/test accuracy vs your current prompts.
Swap cosine similarity for performance vectors + Hamming distance when combining prompts and measure search speed.
Test operator tolerances: let Feedback run briefly but give EDA/Crossover more iterations; measure API calls saved.
Agent Features
Planning
- LoRA
- adaptive phase stop rules
Tool Use
- LLM operators (Lamarckian, EDA, Crossover, Feedback, Semantic)
- LLM as Examiner and Improver agents for feedback operator
Architectures
- metaheuristic-style iterative search
Optimization Features
Token Efficiency
- reports large token savings vs evolutionary baselines (Fig.6)
System Optimization
- adaptive operator selection and phase stop criteria to avoid wasted iterations
Inference Optimization
- reduces API calls and total token consumption via phased search and greedy selection
Reproducibility
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Still requires nontrivial compute: authors report ~12 iterations and ~4,000 API calls in some runs.
- Single‑objective optimization: SEE focuses on accuracy and cost, not multi‑objective tradeoffs like fairness or interpretability.
- No public code link provided in the paper; reproducing exact prompts/operators needs careful prompt engineering.
When Not To Use
- If you need ultra‑low latency online re‑prompting (SEE needs thousands of calls during search).
- If you require multi‑objective tuning (accuracy plus other constraints) out of the box.
- If you cannot run or pay for repeated LLM API usage during the search phase.
Failure Modes
- Search stalls if the initial pool lacks diversity; SEE relies on good initialization (Lamarckian or human examples).
- Operator prompts or LLM failures (API errors) reduce effective evaluation and can bias selection.
- Synthetic few‑shot examples may be incorrect in rare cases; authors found 2/92 inaccuracies but reported little effect on score.
Core Entities
Models
- GPT-3.5-turbo
- GPT-4
- PaLM 2
- Claude 2
- Llama3-70B
- Llama3-8B
- Llama2-7B
- Mistral-7B
Metrics
- Accuracy
Datasets
- BBH (BigBench Hard)
- Ethos
- Liar
- Sarcasm
- Instruction Induction (24 tasks, Honovich et al.)
Benchmarks
- BBH

