Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.7
Citation Count
6
Why It Matters For Business
PREFER automates prompt generation and ensemble weighting, yielding better few-shot accuracy and much lower API/time cost than heavy prompt-search methods.
Summary TLDR
PREFER is an automatic prompt-ensemble method that iteratively asks a large language model (LLM) to reflect on examples it got wrong, generate improved prompts, and ensemble those prompts with a bilateral bagging step (forward+reverse confidence). On few-shot classification and NLI tasks PREFER raises F1 noticeably over single prompts and prior prompt-ensemble/iterative methods, while requiring fewer API calls and converging in 2–3 iterations. Code is released.
Problem Statement
Existing prompt-ensemble methods require a pre-prepared set of prompts or expensive search and do not jointly optimize prompts. This raises manual effort, instability, and extra API cost. PREFER aims to automate prompt creation and joint optimization using LLM-generated feedback so ensembles focus on hard examples.
Main Contribution
Feedback-reflect-refine loop: use LLM-written textual feedback on wrong examples to generate new prompts automatically.
Bilateral prompt bagging: combine forward (is-this-answer) and backward (is-this-answer-excluded) confidence to reduce overconfidence and stabilize outputs.
A full algorithm that iteratively grows a weighted prompt set (boosting-style) and shows better few-shot F1 and lower API cost than baselines.
Empirical ablations showing feedback is the most important component and bilateral bagging beats majority voting.
Public code release to reproduce experiments.
Key Findings
PREFER improves few-shot F1 across several NLI/classification tasks versus single-prompt baselines and prior prompt-ensemble/iterative methods.
Directed feedback is critical: removing the feedback-reflect-refine step causes large accuracy drops on multiple datasets.
Bilateral bagging improves stability and reduces API/time cost compared to an iterative search baseline (APO).
PREFER converges quickly and maintains stable performance after a small number of iterations.
Results
F1 (QNLI, few-shot)
F1 (Liar, few-shot)
Training time, optimization step 1
Training time, optimization step 2
Who Should Care
What To Try In 7 Days
Run the PREFER repo on a small few-shot classification task (k≈50) and compare F1 to your current single-prompt baseline.
Measure API calls and wall time for 2–3 optimization steps vs your existing prompt-search pipeline.
Inspect LLM 'reflections' to identify common failure modes and hard examples for targeted data or prompt fixes.
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on closed LLM APIs (ChatGPT/GPT-4) and their reflection quality.
- Evaluated mainly on few-shot NLI and classification; generalization to other domains is untested.
- Still requires multiple model calls; not as cheap as a single prompt.
- May inherit LLM biases and flawed self-reflections.
When Not To Use
- You lack access or budget for iterative LLM API calls.
- Your task is simple and a single high-quality prompt already suffices.
- Low-latency production needs where any ensemble overhead is unacceptable.
Failure Modes
- LLM produces misleading reflections causing poor prompt generations.
- Overfitting prompts to the sampled few-shot training examples.
- Bagging miscalibrates confidence estimates if the LLM’s confidence signal is unreliable.
Core Entities
Models
- ChatGPT
- GPT-4
- APO
- PromptBoosting
Metrics
- F1-score
- API time (s)
Datasets
- SNLI
- MNLI
- QNLI
- RTE
- Ethos
- Liar
- ArSarcasm
Context Entities
Models
- LLM (generic)
- Beam search / Monte Carlo / Self-consistency baselines
Metrics
- F1
- runtime seconds
- API access frequency
Datasets
- Few-shot splits (k=50 default)

