Evaluate and optimize prompts without gold labels using self- and mutual-consistency

Overview

Decision SnapshotReady For Pilot

Paper demonstrates competitive results across 8 benchmarks and several models, but performance depends on model capability and dataset cleanliness.

Citations0

Evidence Strength0.75

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Xuanchang Zhang, Zhuosheng Zhang, Hai Zhao

Links

Abstract / PDF / Code

Why It Matters For Business

GLaPE lets teams optimize prompts without costly labels, enabling prompt tuning for private models and new tasks while cutting annotation costs.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

GLaPE is an unsupervised scoring method for prompt evaluation that replaces labeled-answer accuracy with two signals: self-consistency (how often a prompt yields the same answer) and mutual-consistency (how answers agree across different prompts). Using GPT-3.5 and eight reasoning benchmarks, GLaPE finds prompts whose task accuracy matches or closely trails label-based optimization, and it generalizes to several open models. The method struggles when the model makes the same wrong answer across prompts.

Problem Statement

Prompt-optimization methods that use the LLM as an optimizer rely on gold labels to score candidate prompts. Collecting labels is costly or impossible for private or new tasks. The problem: how to evaluate and optimize prompts without ground-truth answers.

Main Contribution

Define GLaPE, a gold label-agnostic prompt evaluation combining self-consistency and mutual-consistency refinement.

Show GLaPE can drive prompt optimization to reach accuracy similar to label-based methods (OPRO) on 8 reasoning datasets.

Key Findings

GLaPE-guided prompt optimization matches or closely trails label-based optimization on standard reasoning benchmarks.

NumbersGSM8K: GLaPE 77.7% vs OPRO 76.6%; MultiArith: 99.3% vs 99.6% (Table 3)

Practical UseYou can run prompt search without labels and still get near-supervised prompt quality on common reasoning tasks.

Evidence RefTable 3

Self-consistency (SC) alone can overestimate wrong prompts because incorrect answers may be internally consistent.

NumbersGSM8K SC: correct=82.1% vs incorrect=49.3% yet some wrong prompts had SC=70% (Table 1, Fig.2)

Practical UseRelying only on how repeatable an answer is risks selecting confidently wrong prompts; add cross-prompt checks.

Evidence RefTable 1, Figure 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	77.7%	OPRO 76.6%	+1.1%	GSM8K	GLaPE finds prompt with 77.7% vs OPRO 76.6%	Table 3
Accuracy	99.3%	OPRO 99.6%	-0.3%	MultiArith	GLaPE 99.3% close to OPRO 99.6%	Table 3

What To Try In 7 Days

Run GLaPE instead of label-based scoring when searching prompts for a private LLM.

Use 10 samples per prompt, temperature 0.7, and alpha=0.5 (paper's settings) as a starting point.

Filter out examples that no prompt solves (clean dataset) to improve evaluation reliability.

Optimization Features

Training Optimization

prompt optimization without gradientsunsupervised prompt scoring

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/thunderous77/GLaPE

Risks & Boundaries

Limitations

Fails when the model gives the same wrong answer under all prompts; GLaPE cannot detect such systematic errors.

Requires many model calls (sampling multiple outputs per prompt) which increases inference cost.

When Not To Use

When the LLM consistently produces the same wrong answer across prompts (StrategyQA-style failures).

When you have cheap, high-quality gold labels available; supervised scoring is simpler and more direct.

Failure Modes

Selecting confidently wrong prompts because all prompts reinforce the same incorrect output.

Low correlation on datasets where correct and incorrect answers have similar self-consistency.

Core Entities

Models

GPT-3.5-turbo0613Mistral-7BLlama3-8BGemma2-9B

Metrics

AccuracySelf-consistency (SC)GLaPE scoreSpearman correlation

Datasets

GSM8KAddSubAQuAMultiArithSVAMPMATHBig-Bench DateStrategyQA

Benchmarks

arithmetic reasoningcommonsense reasoningmathematical reasoning

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GLaPE-guided prompt optimization matches or closely trails label-based optimization on standard reasoning benchmarks.

Self-consistency (SC) alone can overestimate wrong prompts because incorrect answers may be internally consistent.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding