IPOMP: pick a small, diverse evaluation set and refine it from live model feedback to get better and more stable prompts

May 15, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Ximing Dong, Shaowei Wang, Dayi Lin, Ahmed E. Hassan

Links

Abstract / PDF

Why It Matters For Business

IPOMP makes automated prompt tuning more reliable and repeatable with tiny extra cost, so you spend fewer trials and less inference budget to reach stable prompt performance.

Summary TLDR

IPOMP is a two-stage, low-cost method to pick a small evaluation set for automated prompt optimization. Stage 1 mixes semantic clustering and boundary (most-dissimilar) examples to get a diverse 20-sample eval set. Stage 2 tracks real-time model outputs during optimization, finds redundant samples (highly correlated performance), and replaces a portion with dissimilar ones. On BIG-bench and LIAR with GPT-3.5 and GPT-4o-mini, IPOMP raised accuracy vs the best baseline by 1.6–3.1%, cut instability (SD) by ≈50%+, and adds <1% runtime overhead. Stage 2 also works as a plug-in to improve other selection methods.

Problem Statement

Prompt optimizers need a small evaluation subset to test candidate prompts, but random subsets often misrepresent the full data. Existing coreset methods either rely on prior model performance (costly or unavailable) or semantics alone (miss boundary cases and redundant examples). The result: unreliable evaluation and suboptimal prompts at unnecessary cost.

Main Contribution

IPOMP: a two-stage evaluation-data selection for prompt optimization combining semantic diversity and live model feedback.

Stage 1: mix K-means semantic clustering with boundary (least-similar) pairs to form a compact, diverse eval set (default N=20).

Stage 2: iteratively record model logits across candidate prompts, detect highly correlated samples (CT=0.9), and replace a fraction β (default 0.5) with semantically dissimilar examples.

Empirical gains on BIG-bench and LIAR with GPT-3.5 and GPT-4o-mini: modest accuracy gains, large stability gains, and <1% extra compute.

Stage 2 can be applied as a plug-in to improve other coreset methods without costly warm-up runs.

Key Findings

IPOMP improves prompt accuracy over the best baseline (Anchor-Point) by a small but consistent margin.

NumbersAccuracy +1.6% to +3.1% (across datasets/models)

IPOMP substantially reduces run-to-run instability (standard deviation).

NumbersSD improved by ≥50% (reduction in std)

The live-refinement stage (Stage 2) is essential for stability and accuracy.

NumbersRemoving Stage 2 drops accuracy by 2.4% and increases SD by factor 2.83

IPOMP adds negligible runtime and monetary overhead compared to prompt optimization itself.

NumbersTotal overhead <1%; Stage 2 ≈2.83s on average

A set size of about 20 evaluation samples balances cost and effectiveness in these experiments.

NumbersPerformance improves from 5→20 samples, plateaus or drops after 20

Results

Accuracy

Value+1.6% to +3.1%

BaselineAnchor-Point (best baseline)

Stability (std dev)

Value≥50% reduction

BaselineBest baseline

Overhead (runtime)

Value<1% extra

BaselinePrompt optimization runtime

Ablation: remove Stage 2

ValueAccuracy −2.4%; SD ×2.83

BaselineFull IPOMP

Sample size effect

Valuebest at ≈20 samples

Baselinesmaller or larger sample sets

Who Should Care

What To Try In 7 Days

Use semantic clustering + boundary selection to pick ~20 eval examples for a new prompt task.

Run your existing prompt optimizer and log logits per example; cluster example performance and replace highly correlated ones (CT=0.9) with semantically dissimilar examples.

Add Stage 2 as a plugin to your current evaluation sampler and compare SD across 5 runs to confirm stability improvements.

Optimization Features

System Optimization

  • low extra runtime (<1%)

Training Optimization

  • data-efficient evaluation via coreset selection

Reproducibility

Data Urls

  • BIG-bench (public)
  • LIAR (public)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Evaluations cover two datasets and two LLMs; results may not generalize to very different models or tasks.
  • Method requires access to per-example model logits or confidence during optimization, which some APIs or blackbox setups may not expose.
  • Boundary and clustering heuristics rely on Sentence-BERT embeddings and chosen hyperparameters; extreme domain shifts may reduce effectiveness.

When Not To Use

  • You cannot collect per-example logits or confidences during prompt optimization.
  • Tasks where the notion of semantic similarity or the chosen embedding model fails (e.g., heavily structured non-text inputs).
  • When the full training set evaluation is affordable and you prefer end-to-end measurement.

Failure Modes

  • If model logits are noisy or miscalibrated, correlation-based redundancy detection may misidentify diverse examples as redundant.
  • Over-replacing samples (β too large) can hurt effectiveness if replacements are not representative.
  • Boundary selection may overrepresent outliers if boundary detection is imperfect.

Core Entities

Models

  • GPT-3.5
  • GPT-4o-mini

Metrics

  • Accuracy
  • Standard deviation
  • Correlation

Datasets

  • BIG-bench
  • LIAR