Evaluate and optimize prompts without gold labels using self- and mutual-consistency

February 4, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Xuanchang Zhang, Zhuosheng Zhang, Hai Zhao

Links

Abstract / PDF

Why It Matters For Business

GLaPE lets teams optimize prompts without costly labels, enabling prompt tuning for private models and new tasks while cutting annotation costs.

Summary TLDR

GLaPE is an unsupervised scoring method for prompt evaluation that replaces labeled-answer accuracy with two signals: self-consistency (how often a prompt yields the same answer) and mutual-consistency (how answers agree across different prompts). Using GPT-3.5 and eight reasoning benchmarks, GLaPE finds prompts whose task accuracy matches or closely trails label-based optimization, and it generalizes to several open models. The method struggles when the model makes the same wrong answer across prompts.

Problem Statement

Prompt-optimization methods that use the LLM as an optimizer rely on gold labels to score candidate prompts. Collecting labels is costly or impossible for private or new tasks. The problem: how to evaluate and optimize prompts without ground-truth answers.

Main Contribution

Define GLaPE, a gold label-agnostic prompt evaluation combining self-consistency and mutual-consistency refinement.

Show GLaPE can drive prompt optimization to reach accuracy similar to label-based methods (OPRO) on 8 reasoning datasets.

Diagnose when self-consistency fails and show mutual-consistency reduces overestimation of wrong-but-consistent prompts.

Key Findings

GLaPE-guided prompt optimization matches or closely trails label-based optimization on standard reasoning benchmarks.

NumbersGSM8K: GLaPE 77.7% vs OPRO 76.6%; MultiArith: 99.3% vs 99.6% (Table 3)

Self-consistency (SC) alone can overestimate wrong prompts because incorrect answers may be internally consistent.

NumbersGSM8K SC: correct=82.1% vs incorrect=49.3% yet some wrong prompts had SC=70% (Table 1, Fig.2)

GLaPE correlates better with true accuracy than SC across datasets.

NumbersSpearman: GLaPE 0.49 vs SC 0.40 on GSM8K; GLaPE 0.88 vs SC 0.29 on MultiArith (Table 2)

Method generalizes to other LLMs: GLaPE finds competitive prompts for multiple open models.

NumbersGSM8K accuracies (Mistral/Llama3/Gemma2): baseline/OPRO/GLaPE = 33.8/35.9/35.9, 45.4/48.6/48.9, 39.7/42.4/43.2 (Table 5)

Removing questions that no prompt solves raises GLaPE h's correlation with accuracy substantially.

NumbersSpearman on AQuA: original 0.04 -> cleaned 0.40 (Table 10)

Results

Accuracy

Value77.7%

BaselineOPRO 76.6%

Accuracy

Value99.3%

BaselineOPRO 99.6%

GLaPE vs SC Spearman correlation (selected)

ValueGLaPE 0.88 vs SC 0.29

BaselineSC

Model transfer on GSM8K (Gemma2-9B)

ValueGLaPE 43.2%

BaselineOPRO 42.4%

Who Should Care

What To Try In 7 Days

Run GLaPE instead of label-based scoring when searching prompts for a private LLM.

Use 10 samples per prompt, temperature 0.7, and alpha=0.5 (paper's settings) as a starting point.

Filter out examples that no prompt solves (clean dataset) to improve evaluation reliability.

Optimization Features

Training Optimization

  • prompt optimization without gradients
  • unsupervised prompt scoring

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Fails when the model gives the same wrong answer under all prompts; GLaPE cannot detect such systematic errors.
  • Requires many model calls (sampling multiple outputs per prompt) which increases inference cost.
  • Evaluation compresses a prompt to a single numeric score, losing rich qualitative feedback.

When Not To Use

  • When the LLM consistently produces the same wrong answer across prompts (StrategyQA-style failures).
  • When you have cheap, high-quality gold labels available; supervised scoring is simpler and more direct.
  • If inference budget cannot support repeated sampling per prompt.

Failure Modes

  • Selecting confidently wrong prompts because all prompts reinforce the same incorrect output.
  • Low correlation on datasets where correct and incorrect answers have similar self-consistency.
  • Overfitting to the sampled subset if training dataset size or sample count is too small.

Core Entities

Models

  • GPT-3.5-turbo0613
  • Mistral-7B
  • Llama3-8B
  • Gemma2-9B

Metrics

  • Accuracy
  • Self-consistency (SC)
  • GLaPE score
  • Spearman correlation

Datasets

  • GSM8K
  • AddSub
  • AQuA
  • MultiArith
  • SVAMP
  • MATH
  • Big-Bench Date
  • StrategyQA

Benchmarks

  • arithmetic reasoning
  • commonsense reasoning
  • mathematical reasoning