PREFER: automatically grow and weight prompts by having an LLM reflect on its errors and refine prompts

Overview

Decision SnapshotNeeds Validation

PREFER is a practical, implementable method: it uses off-the-shelf LLMs, shows measurable few-shot gains and lower API/time cost in experiments, but depends on LLM API access and was tested mainly on NLI/classification.

Citations6

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 65%

Authors

Chenrui Zhang, Lin Liu, Jinpeng Wang, Chuyuan Wang, Xiao Sun, Hongyu Wang, Mingchen Cai

Links

Abstract / PDF / Code

Why It Matters For Business

PREFER automates prompt generation and ensemble weighting, yielding better few-shot accuracy and much lower API/time cost than heavy prompt-search methods.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

PREFER is an automatic prompt-ensemble method that iteratively asks a large language model (LLM) to reflect on examples it got wrong, generate improved prompts, and ensemble those prompts with a bilateral bagging step (forward+reverse confidence). On few-shot classification and NLI tasks PREFER raises F1 noticeably over single prompts and prior prompt-ensemble/iterative methods, while requiring fewer API calls and converging in 2–3 iterations. Code is released.

Problem Statement

Existing prompt-ensemble methods require a pre-prepared set of prompts or expensive search and do not jointly optimize prompts. This raises manual effort, instability, and extra API cost. PREFER aims to automate prompt creation and joint optimization using LLM-generated feedback so ensembles focus on hard examples.

Main Contribution

Feedback-reflect-refine loop: use LLM-written textual feedback on wrong examples to generate new prompts automatically.

Bilateral prompt bagging: combine forward (is-this-answer) and backward (is-this-answer-excluded) confidence to reduce overconfidence and stabilize outputs.

Key Findings

PREFER improves few-shot F1 across several NLI/classification tasks versus single-prompt baselines and prior prompt-ensemble/iterative methods.

NumbersQNLI: 0.793 vs 0.720 (synonym ensemble); Liar: 0.744 vs 0.572 (synonym)

Practical UseUse PREFER instead of a single prompt for few-shot NLI/classification to get consistent F1 gains on evaluated datasets.

Evidence RefTable 1

Directed feedback is critical: removing the feedback-reflect-refine step causes large accuracy drops on multiple datasets.

NumbersSNLI 0.647→0.580; Ethos 0.963→0.812 after removing feedback

Practical UseKeep the LLM feedback/reflection loop when building prompt ensembles; naive prompt rewriting is much weaker.

Evidence RefTable 2 (ablation)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
F1 (QNLI, few-shot)	0.793 (PREFER)	0.720 (Synonym Ensemble)	+0.073 absolute	QNLI (few-shot, k=50)	Table 1 shows PREFER 0.793 vs Synonym Ensemble 0.720	Table 1
F1 (Liar, few-shot)	0.744 (PREFER)	0.572 (Synonym Ensemble)	+0.172 absolute (paper reports up to 13.1% relative)	Liar (few-shot, k=50)	Table 1 lists 0.744 vs 0.572	Table 1

What To Try In 7 Days

Run the PREFER repo on a small few-shot classification task (k≈50) and compare F1 to your current single-prompt baseline.

Measure API calls and wall time for 2–3 optimization steps vs your existing prompt-search pipeline.

Inspect LLM 'reflections' to identify common failure modes and hard examples for targeted data or prompt fixes.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/zcrwind/PREFER

Risks & Boundaries

Limitations

Relies on closed LLM APIs (ChatGPT/GPT-4) and their reflection quality.

Evaluated mainly on few-shot NLI and classification; generalization to other domains is untested.

When Not To Use

You lack access or budget for iterative LLM API calls.

Your task is simple and a single high-quality prompt already suffices.

Failure Modes

LLM produces misleading reflections causing poor prompt generations.

Overfitting prompts to the sampled few-shot training examples.

Core Entities

Models

ChatGPTGPT-4APOPromptBoosting

Metrics

F1-scoreAPI time (s)

Datasets

SNLIMNLIQNLIRTEEthosLiarArSarcasm

Context Entities

Models

LLM (generic)Beam search / Monte Carlo / Self-consistency baselines

Metrics

F1runtime secondsAPI access frequency

Datasets

Few-shot splits (k=50 default)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

PREFER improves few-shot F1 across several NLI/classification tasks versus single-prompt baselines and prior prompt-ensemble/iterative methods.

Directed feedback is critical: removing the feedback-reflect-refine step causes large accuracy drops on multiple datasets.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

BLPO: jointly optimize judge and caption prompts to better align multimodal LLM judges with human image judgments

Key finding

AutoPDL: AutoML that finds and returns editable, executable prompt programs for LLM agents

Key finding

Use evolutionary search to generate harmless prompts that trigger unnecessary LLM refusals, build tests and alignment data, and reduce over‑

Key finding

Find a model's true knowledge boundary by optimizing prompts that preserve meaning

Key finding

IPOMP: pick a small, diverse evaluation set and refine it from live model feedback to get better and more stable prompts

Key finding