PREFER: automatically grow and weight prompts by having an LLM reflect on its errors and refine prompts

August 23, 20237 min

Overview

Decision SnapshotNeeds Validation

PREFER is a practical, implementable method: it uses off-the-shelf LLMs, shows measurable few-shot gains and lower API/time cost in experiments, but depends on LLM API access and was tested mainly on NLI/classification.

Citations6

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 65%

Authors

Chenrui Zhang, Lin Liu, Jinpeng Wang, Chuyuan Wang, Xiao Sun, Hongyu Wang, Mingchen Cai

Links

Abstract / PDF / Code

Why It Matters For Business

PREFER automates prompt generation and ensemble weighting, yielding better few-shot accuracy and much lower API/time cost than heavy prompt-search methods.

Who Should Care

Summary TLDR

PREFER is an automatic prompt-ensemble method that iteratively asks a large language model (LLM) to reflect on examples it got wrong, generate improved prompts, and ensemble those prompts with a bilateral bagging step (forward+reverse confidence). On few-shot classification and NLI tasks PREFER raises F1 noticeably over single prompts and prior prompt-ensemble/iterative methods, while requiring fewer API calls and converging in 2–3 iterations. Code is released.

Problem Statement

Existing prompt-ensemble methods require a pre-prepared set of prompts or expensive search and do not jointly optimize prompts. This raises manual effort, instability, and extra API cost. PREFER aims to automate prompt creation and joint optimization using LLM-generated feedback so ensembles focus on hard examples.

Main Contribution

Feedback-reflect-refine loop: use LLM-written textual feedback on wrong examples to generate new prompts automatically.

Bilateral prompt bagging: combine forward (is-this-answer) and backward (is-this-answer-excluded) confidence to reduce overconfidence and stabilize outputs.

Key Findings

PREFER improves few-shot F1 across several NLI/classification tasks versus single-prompt baselines and prior prompt-ensemble/iterative methods.

NumbersQNLI: 0.793 vs 0.720 (synonym ensemble); Liar: 0.744 vs 0.572 (synonym)

Practical UseUse PREFER instead of a single prompt for few-shot NLI/classification to get consistent F1 gains on evaluated datasets.

Evidence RefTable 1

Directed feedback is critical: removing the feedback-reflect-refine step causes large accuracy drops on multiple datasets.

NumbersSNLI 0.6470.580; Ethos 0.9630.812 after removing feedback

Practical UseKeep the LLM feedback/reflection loop when building prompt ensembles; naive prompt rewriting is much weaker.

Evidence RefTable 2 (ablation)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
F1 (QNLI, few-shot)0.793 (PREFER)0.720 (Synonym Ensemble)+0.073 absoluteQNLI (few-shot, k=50)Table 1 shows PREFER 0.793 vs Synonym Ensemble 0.720Table 1
F1 (Liar, few-shot)0.744 (PREFER)0.572 (Synonym Ensemble)+0.172 absolute (paper reports up to 13.1% relative)Liar (few-shot, k=50)Table 1 lists 0.744 vs 0.572Table 1

What To Try In 7 Days

Run the PREFER repo on a small few-shot classification task (k≈50) and compare F1 to your current single-prompt baseline.

Measure API calls and wall time for 2–3 optimization steps vs your existing prompt-search pipeline.

Inspect LLM 'reflections' to identify common failure modes and hard examples for targeted data or prompt fixes.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Relies on closed LLM APIs (ChatGPT/GPT-4) and their reflection quality.

Evaluated mainly on few-shot NLI and classification; generalization to other domains is untested.

When Not To Use

You lack access or budget for iterative LLM API calls.

Your task is simple and a single high-quality prompt already suffices.

Failure Modes

LLM produces misleading reflections causing poor prompt generations.

Overfitting prompts to the sampled few-shot training examples.

Core Entities

Models

ChatGPTGPT-4APOPromptBoosting

Metrics

F1-scoreAPI time (s)

Datasets

SNLIMNLIQNLIRTEEthosLiarArSarcasm

Context Entities

Models

LLM (generic)Beam search / Monte Carlo / Self-consistency baselines

Metrics

F1runtime secondsAPI access frequency

Datasets

Few-shot splits (k=50 default)