Evaluate and optimize prompts without gold labels using self- and mutual-consistency

February 4, 20246 min

Overview

Decision SnapshotReady For Pilot

Paper demonstrates competitive results across 8 benchmarks and several models, but performance depends on model capability and dataset cleanliness.

Citations0

Evidence Strength0.75

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Xuanchang Zhang, Zhuosheng Zhang, Hai Zhao

Links

Abstract / PDF / Code

Why It Matters For Business

GLaPE lets teams optimize prompts without costly labels, enabling prompt tuning for private models and new tasks while cutting annotation costs.

Who Should Care

Summary TLDR

GLaPE is an unsupervised scoring method for prompt evaluation that replaces labeled-answer accuracy with two signals: self-consistency (how often a prompt yields the same answer) and mutual-consistency (how answers agree across different prompts). Using GPT-3.5 and eight reasoning benchmarks, GLaPE finds prompts whose task accuracy matches or closely trails label-based optimization, and it generalizes to several open models. The method struggles when the model makes the same wrong answer across prompts.

Problem Statement

Prompt-optimization methods that use the LLM as an optimizer rely on gold labels to score candidate prompts. Collecting labels is costly or impossible for private or new tasks. The problem: how to evaluate and optimize prompts without ground-truth answers.

Main Contribution

Define GLaPE, a gold label-agnostic prompt evaluation combining self-consistency and mutual-consistency refinement.

Show GLaPE can drive prompt optimization to reach accuracy similar to label-based methods (OPRO) on 8 reasoning datasets.

Key Findings

GLaPE-guided prompt optimization matches or closely trails label-based optimization on standard reasoning benchmarks.

NumbersGSM8K: GLaPE 77.7% vs OPRO 76.6%; MultiArith: 99.3% vs 99.6% (Table 3)

Practical UseYou can run prompt search without labels and still get near-supervised prompt quality on common reasoning tasks.

Evidence RefTable 3

Self-consistency (SC) alone can overestimate wrong prompts because incorrect answers may be internally consistent.

NumbersGSM8K SC: correct=82.1% vs incorrect=49.3% yet some wrong prompts had SC=70% (Table 1, Fig.2)

Practical UseRelying only on how repeatable an answer is risks selecting confidently wrong prompts; add cross-prompt checks.

Evidence RefTable 1, Figure 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy77.7%OPRO 76.6%+1.1%GSM8KGLaPE finds prompt with 77.7% vs OPRO 76.6%Table 3
Accuracy99.3%OPRO 99.6%-0.3%MultiArithGLaPE 99.3% close to OPRO 99.6%Table 3

What To Try In 7 Days

Run GLaPE instead of label-based scoring when searching prompts for a private LLM.

Use 10 samples per prompt, temperature 0.7, and alpha=0.5 (paper's settings) as a starting point.

Filter out examples that no prompt solves (clean dataset) to improve evaluation reliability.

Optimization Features

Training Optimization
prompt optimization without gradientsunsupervised prompt scoring

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Fails when the model gives the same wrong answer under all prompts; GLaPE cannot detect such systematic errors.

Requires many model calls (sampling multiple outputs per prompt) which increases inference cost.

When Not To Use

When the LLM consistently produces the same wrong answer across prompts (StrategyQA-style failures).

When you have cheap, high-quality gold labels available; supervised scoring is simpler and more direct.

Failure Modes

Selecting confidently wrong prompts because all prompts reinforce the same incorrect output.

Low correlation on datasets where correct and incorrect answers have similar self-consistency.

Core Entities

Models

GPT-3.5-turbo0613Mistral-7BLlama3-8BGemma2-9B

Metrics

AccuracySelf-consistency (SC)GLaPE scoreSpearman correlation

Datasets

GSM8KAddSubAQuAMultiArithSVAMPMATHBig-Bench DateStrategyQA

Benchmarks

arithmetic reasoningcommonsense reasoningmathematical reasoning