Overview
PREFER is a practical, implementable method: it uses off-the-shelf LLMs, shows measurable few-shot gains and lower API/time cost in experiments, but depends on LLM API access and was tested mainly on NLI/classification.
Citations6
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 65%
Why It Matters For Business
PREFER automates prompt generation and ensemble weighting, yielding better few-shot accuracy and much lower API/time cost than heavy prompt-search methods.
Who Should Care
Summary TLDR
PREFER is an automatic prompt-ensemble method that iteratively asks a large language model (LLM) to reflect on examples it got wrong, generate improved prompts, and ensemble those prompts with a bilateral bagging step (forward+reverse confidence). On few-shot classification and NLI tasks PREFER raises F1 noticeably over single prompts and prior prompt-ensemble/iterative methods, while requiring fewer API calls and converging in 2–3 iterations. Code is released.
Problem Statement
Existing prompt-ensemble methods require a pre-prepared set of prompts or expensive search and do not jointly optimize prompts. This raises manual effort, instability, and extra API cost. PREFER aims to automate prompt creation and joint optimization using LLM-generated feedback so ensembles focus on hard examples.
Main Contribution
Feedback-reflect-refine loop: use LLM-written textual feedback on wrong examples to generate new prompts automatically.
Bilateral prompt bagging: combine forward (is-this-answer) and backward (is-this-answer-excluded) confidence to reduce overconfidence and stabilize outputs.
Key Findings
PREFER improves few-shot F1 across several NLI/classification tasks versus single-prompt baselines and prior prompt-ensemble/iterative methods.
Directed feedback is critical: removing the feedback-reflect-refine step causes large accuracy drops on multiple datasets.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| F1 (QNLI, few-shot) | 0.793 (PREFER) | 0.720 (Synonym Ensemble) | +0.073 absolute | QNLI (few-shot, k=50) | Table 1 shows PREFER 0.793 vs Synonym Ensemble 0.720 | Table 1 |
| F1 (Liar, few-shot) | 0.744 (PREFER) | 0.572 (Synonym Ensemble) | +0.172 absolute (paper reports up to 13.1% relative) | Liar (few-shot, k=50) | Table 1 lists 0.744 vs 0.572 | Table 1 |
What To Try In 7 Days
Run the PREFER repo on a small few-shot classification task (k≈50) and compare F1 to your current single-prompt baseline.
Measure API calls and wall time for 2–3 optimization steps vs your existing prompt-search pipeline.
Inspect LLM 'reflections' to identify common failure modes and hard examples for targeted data or prompt fixes.
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Relies on closed LLM APIs (ChatGPT/GPT-4) and their reflection quality.
Evaluated mainly on few-shot NLI and classification; generalization to other domains is untested.
When Not To Use
You lack access or budget for iterative LLM API calls.
Your task is simple and a single high-quality prompt already suffices.
Failure Modes
LLM produces misleading reflections causing poor prompt generations.
Overfitting prompts to the sampled few-shot training examples.

