SEE: a quad‑phased, operator-driven system that jointly optimizes instructions and examples to make LLM prompts stronger and cheaper

February 17, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

1

Authors

Wendi Cui, Zhuohang Li, Hao Sun, Damien Lopez, Kamalika Das, Bradley Malin, Sricharan Kumar, Jiaxin Zhang

Links

Abstract / PDF

Why It Matters For Business

SEE finds stronger prompts with far fewer API calls and tokens, so teams can improve LLM task accuracy while cutting prompt optimization cost and speeding experimentation.

Summary TLDR

SEE is a prompt‑search system that treats a full prompt (instruction + few‑shot examples) as one optimization target. It runs a four‑phase loop that alternates focused local operators (fast local fixes) with global fusion operators (search across candidates). On 35 public tasks SEE often finds stronger prompts than recent baselines while using far fewer LLM API calls and tokens. The method is model‑agnostic, uses five LLM operators (Lamarckian, EDA, Crossover, Feedback, Semantic), and adds two practical tweaks: performance‑based vectors with Hamming distance and adaptive phase stop rules.

Problem Statement

Current automatic prompt search usually optimizes instruction text or example selection separately. That splits the prompt and misses interactions between instruction and examples. Jointly optimizing both is combinatorial, expensive, and hard to converge. The problem: how to search this high‑dimensional discrete space efficiently and reliably so prompts work better while keeping API/token cost reasonable.

Main Contribution

Formulate cohesive prompt optimization: jointly search instruction + examples to find prompts that work together.

Design SEE, a quad‑phased metaheuristic that alternates exploration (global operators) and exploitation (local operators) and adaptively picks LLM operators.

Two practical additions: use performance vectors + Hamming distance to measure candidate diversity, and adaptive phase stop rules to limit wasted API calls.

Extensive evaluation on 35 tasks vs 9 baselines showing higher accuracy and lower computational cost.

Key Findings

On hard BBH tasks SEE improves final test accuracy vs prior SOTA by double‑digit points.

Numbersavg +13.94 percentage points on BBH (8 tasks)

SEE cuts prompt optimization compute (API calls and tokens) substantially versus evolutionary/metaheuristic baselines.

Numberscosts −58.67% (API/token) vs SOTA on reported comparisons

Performance‑based vectors + Hamming distance help select diverse parents and improve search.

NumbersHamming > cosine by +5.2 pp (Disambiguation) and +4.6 pp (Hyperbaton)

Different operators have distinct roles: Feedback converges fast; EDA/Crossover improve exploration.

NumbersFeedback yields most improvement in step 1; EDA/Crossover show improvement across multiple steps (operator analysis ran

Results

Accuracy

Valueavg +13.94 pp

BaselineSOTA (averaged over compared methods)

compute (API calls / tokens)

Value−58.67%

BaselineSOTA methods (aggregate comparison)

per-task gains vs AELP

Value+15.31 pp (avg)

BaselineAELP

Who Should Care

What To Try In 7 Days

Run SEE (GPT‑3.5) on one hard task you care about and compare final dev/test accuracy vs your current prompts.

Swap cosine similarity for performance vectors + Hamming distance when combining prompts and measure search speed.

Test operator tolerances: let Feedback run briefly but give EDA/Crossover more iterations; measure API calls saved.

Agent Features

Planning

  • LoRA
  • adaptive phase stop rules

Tool Use

  • LLM operators (Lamarckian, EDA, Crossover, Feedback, Semantic)
  • LLM as Examiner and Improver agents for feedback operator

Architectures

  • metaheuristic-style iterative search

Optimization Features

Token Efficiency

  • reports large token savings vs evolutionary baselines (Fig.6)

System Optimization

  • adaptive operator selection and phase stop criteria to avoid wasted iterations

Inference Optimization

  • reduces API calls and total token consumption via phased search and greedy selection

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Still requires nontrivial compute: authors report ~12 iterations and ~4,000 API calls in some runs.
  • Single‑objective optimization: SEE focuses on accuracy and cost, not multi‑objective tradeoffs like fairness or interpretability.
  • No public code link provided in the paper; reproducing exact prompts/operators needs careful prompt engineering.

When Not To Use

  • If you need ultra‑low latency online re‑prompting (SEE needs thousands of calls during search).
  • If you require multi‑objective tuning (accuracy plus other constraints) out of the box.
  • If you cannot run or pay for repeated LLM API usage during the search phase.

Failure Modes

  • Search stalls if the initial pool lacks diversity; SEE relies on good initialization (Lamarckian or human examples).
  • Operator prompts or LLM failures (API errors) reduce effective evaluation and can bias selection.
  • Synthetic few‑shot examples may be incorrect in rare cases; authors found 2/92 inaccuracies but reported little effect on score.

Core Entities

Models

  • GPT-3.5-turbo
  • GPT-4
  • PaLM 2
  • Claude 2
  • Llama3-70B
  • Llama3-8B
  • Llama2-7B
  • Mistral-7B

Metrics

  • Accuracy

Datasets

  • BBH (BigBench Hard)
  • Ethos
  • Liar
  • Sarcasm
  • Instruction Induction (24 tasks, Honovich et al.)

Benchmarks

  • BBH