PREFER: automatically grow and weight prompts by having an LLM reflect on its errors and refine prompts

August 23, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.7

Citation Count

6

Authors

Chenrui Zhang, Lin Liu, Jinpeng Wang, Chuyuan Wang, Xiao Sun, Hongyu Wang, Mingchen Cai

Links

Abstract / PDF

Why It Matters For Business

PREFER automates prompt generation and ensemble weighting, yielding better few-shot accuracy and much lower API/time cost than heavy prompt-search methods.

Summary TLDR

PREFER is an automatic prompt-ensemble method that iteratively asks a large language model (LLM) to reflect on examples it got wrong, generate improved prompts, and ensemble those prompts with a bilateral bagging step (forward+reverse confidence). On few-shot classification and NLI tasks PREFER raises F1 noticeably over single prompts and prior prompt-ensemble/iterative methods, while requiring fewer API calls and converging in 2–3 iterations. Code is released.

Problem Statement

Existing prompt-ensemble methods require a pre-prepared set of prompts or expensive search and do not jointly optimize prompts. This raises manual effort, instability, and extra API cost. PREFER aims to automate prompt creation and joint optimization using LLM-generated feedback so ensembles focus on hard examples.

Main Contribution

Feedback-reflect-refine loop: use LLM-written textual feedback on wrong examples to generate new prompts automatically.

Bilateral prompt bagging: combine forward (is-this-answer) and backward (is-this-answer-excluded) confidence to reduce overconfidence and stabilize outputs.

A full algorithm that iteratively grows a weighted prompt set (boosting-style) and shows better few-shot F1 and lower API cost than baselines.

Empirical ablations showing feedback is the most important component and bilateral bagging beats majority voting.

Public code release to reproduce experiments.

Key Findings

PREFER improves few-shot F1 across several NLI/classification tasks versus single-prompt baselines and prior prompt-ensemble/iterative methods.

NumbersQNLI: 0.793 vs 0.720 (synonym ensemble); Liar: 0.744 vs 0.572 (synonym)

Directed feedback is critical: removing the feedback-reflect-refine step causes large accuracy drops on multiple datasets.

NumbersSNLI 0.647→0.580; Ethos 0.963→0.812 after removing feedback

Bilateral bagging improves stability and reduces API/time cost compared to an iterative search baseline (APO).

NumbersOptimization step time: step1 132.4s vs 579.0s; step2 336.1s vs 2100.4s (PREFER vs APO)

PREFER converges quickly and maintains stable performance after a small number of iterations.

NumbersPerformance peaks by optimization steps 2–3 and stays stable (Figure 3)

Results

F1 (QNLI, few-shot)

Value0.793 (PREFER)

Baseline0.720 (Synonym Ensemble)

F1 (Liar, few-shot)

Value0.744 (PREFER)

Baseline0.572 (Synonym Ensemble)

Training time, optimization step 1

Value132.4 s (PREFER)

Baseline579.0 s (APO)

Training time, optimization step 2

Value336.1 s (PREFER)

Baseline2100.4 s (APO)

Who Should Care

What To Try In 7 Days

Run the PREFER repo on a small few-shot classification task (k≈50) and compare F1 to your current single-prompt baseline.

Measure API calls and wall time for 2–3 optimization steps vs your existing prompt-search pipeline.

Inspect LLM 'reflections' to identify common failure modes and hard examples for targeted data or prompt fixes.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on closed LLM APIs (ChatGPT/GPT-4) and their reflection quality.
  • Evaluated mainly on few-shot NLI and classification; generalization to other domains is untested.
  • Still requires multiple model calls; not as cheap as a single prompt.
  • May inherit LLM biases and flawed self-reflections.

When Not To Use

  • You lack access or budget for iterative LLM API calls.
  • Your task is simple and a single high-quality prompt already suffices.
  • Low-latency production needs where any ensemble overhead is unacceptable.

Failure Modes

  • LLM produces misleading reflections causing poor prompt generations.
  • Overfitting prompts to the sampled few-shot training examples.
  • Bagging miscalibrates confidence estimates if the LLM’s confidence signal is unreliable.

Core Entities

Models

  • ChatGPT
  • GPT-4
  • APO
  • PromptBoosting

Metrics

  • F1-score
  • API time (s)

Datasets

  • SNLI
  • MNLI
  • QNLI
  • RTE
  • Ethos
  • Liar
  • ArSarcasm

Context Entities

Models

  • LLM (generic)
  • Beam search / Monte Carlo / Self-consistency baselines

Metrics

  • F1
  • runtime seconds
  • API access frequency

Datasets

  • Few-shot splits (k=50 default)