Overview
Paper demonstrates competitive results across 8 benchmarks and several models, but performance depends on model capability and dataset cleanliness.
Citations0
Evidence Strength0.75
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
GLaPE lets teams optimize prompts without costly labels, enabling prompt tuning for private models and new tasks while cutting annotation costs.
Who Should Care
Summary TLDR
GLaPE is an unsupervised scoring method for prompt evaluation that replaces labeled-answer accuracy with two signals: self-consistency (how often a prompt yields the same answer) and mutual-consistency (how answers agree across different prompts). Using GPT-3.5 and eight reasoning benchmarks, GLaPE finds prompts whose task accuracy matches or closely trails label-based optimization, and it generalizes to several open models. The method struggles when the model makes the same wrong answer across prompts.
Problem Statement
Prompt-optimization methods that use the LLM as an optimizer rely on gold labels to score candidate prompts. Collecting labels is costly or impossible for private or new tasks. The problem: how to evaluate and optimize prompts without ground-truth answers.
Main Contribution
Define GLaPE, a gold label-agnostic prompt evaluation combining self-consistency and mutual-consistency refinement.
Show GLaPE can drive prompt optimization to reach accuracy similar to label-based methods (OPRO) on 8 reasoning datasets.
Key Findings
GLaPE-guided prompt optimization matches or closely trails label-based optimization on standard reasoning benchmarks.
Self-consistency (SC) alone can overestimate wrong prompts because incorrect answers may be internally consistent.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 77.7% | OPRO 76.6% | +1.1% | GSM8K | GLaPE finds prompt with 77.7% vs OPRO 76.6% | Table 3 |
| Accuracy | 99.3% | OPRO 99.6% | -0.3% | MultiArith | GLaPE 99.3% close to OPRO 99.6% | Table 3 |
What To Try In 7 Days
Run GLaPE instead of label-based scoring when searching prompts for a private LLM.
Use 10 samples per prompt, temperature 0.7, and alpha=0.5 (paper's settings) as a starting point.
Filter out examples that no prompt solves (clean dataset) to improve evaluation reliability.
Optimization Features
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Fails when the model gives the same wrong answer under all prompts; GLaPE cannot detect such systematic errors.
Requires many model calls (sampling multiple outputs per prompt) which increases inference cost.
When Not To Use
When the LLM consistently produces the same wrong answer across prompts (StrategyQA-style failures).
When you have cheap, high-quality gold labels available; supervised scoring is simpler and more direct.
Failure Modes
Selecting confidently wrong prompts because all prompts reinforce the same incorrect output.
Low correlation on datasets where correct and incorrect answers have similar self-consistency.

