Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
GLaPE lets teams optimize prompts without costly labels, enabling prompt tuning for private models and new tasks while cutting annotation costs.
Summary TLDR
GLaPE is an unsupervised scoring method for prompt evaluation that replaces labeled-answer accuracy with two signals: self-consistency (how often a prompt yields the same answer) and mutual-consistency (how answers agree across different prompts). Using GPT-3.5 and eight reasoning benchmarks, GLaPE finds prompts whose task accuracy matches or closely trails label-based optimization, and it generalizes to several open models. The method struggles when the model makes the same wrong answer across prompts.
Problem Statement
Prompt-optimization methods that use the LLM as an optimizer rely on gold labels to score candidate prompts. Collecting labels is costly or impossible for private or new tasks. The problem: how to evaluate and optimize prompts without ground-truth answers.
Main Contribution
Define GLaPE, a gold label-agnostic prompt evaluation combining self-consistency and mutual-consistency refinement.
Show GLaPE can drive prompt optimization to reach accuracy similar to label-based methods (OPRO) on 8 reasoning datasets.
Diagnose when self-consistency fails and show mutual-consistency reduces overestimation of wrong-but-consistent prompts.
Key Findings
GLaPE-guided prompt optimization matches or closely trails label-based optimization on standard reasoning benchmarks.
Self-consistency (SC) alone can overestimate wrong prompts because incorrect answers may be internally consistent.
GLaPE correlates better with true accuracy than SC across datasets.
Method generalizes to other LLMs: GLaPE finds competitive prompts for multiple open models.
Removing questions that no prompt solves raises GLaPE h's correlation with accuracy substantially.
Results
Accuracy
Accuracy
GLaPE vs SC Spearman correlation (selected)
Model transfer on GSM8K (Gemma2-9B)
Who Should Care
What To Try In 7 Days
Run GLaPE instead of label-based scoring when searching prompts for a private LLM.
Use 10 samples per prompt, temperature 0.7, and alpha=0.5 (paper's settings) as a starting point.
Filter out examples that no prompt solves (clean dataset) to improve evaluation reliability.
Optimization Features
Training Optimization
- prompt optimization without gradients
- unsupervised prompt scoring
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Fails when the model gives the same wrong answer under all prompts; GLaPE cannot detect such systematic errors.
- Requires many model calls (sampling multiple outputs per prompt) which increases inference cost.
- Evaluation compresses a prompt to a single numeric score, losing rich qualitative feedback.
When Not To Use
- When the LLM consistently produces the same wrong answer across prompts (StrategyQA-style failures).
- When you have cheap, high-quality gold labels available; supervised scoring is simpler and more direct.
- If inference budget cannot support repeated sampling per prompt.
Failure Modes
- Selecting confidently wrong prompts because all prompts reinforce the same incorrect output.
- Low correlation on datasets where correct and incorrect answers have similar self-consistency.
- Overfitting to the sampled subset if training dataset size or sample count is too small.
Core Entities
Models
- GPT-3.5-turbo0613
- Mistral-7B
- Llama3-8B
- Gemma2-9B
Metrics
- Accuracy
- Self-consistency (SC)
- GLaPE score
- Spearman correlation
Datasets
- GSM8K
- AddSub
- AQuA
- MultiArith
- SVAMP
- MATH
- Big-Bench Date
- StrategyQA
Benchmarks
- arithmetic reasoning
- commonsense reasoning
- mathematical reasoning

