Use model log-probabilities (KL / cross-entropy) to rank prompts and cut human evaluation cost by prioritizing decisive examples

Overview

Decision SnapshotNeeds Validation

Method is simple, uses only model generation probabilities, and shows consistent efficiency gains; strongest when comparing similar model variants.

Citations4

Evidence Strength0.70

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 75%

Novelty: 45%

Authors

Meriem Boubdir, Edward Kim, Beyza Ermis, Marzieh Fadaee, Sara Hooker

Links

Abstract / PDF

Why It Matters For Business

Prioritize prompts by model output dissimilarity to cut human labeling cost and time while preserving reliable model rankings, especially when comparing similar model variants.

Who Should Care

Product Manager ML Engineer Engineering Lead

Summary TLDR

Human pairwise evaluation of LLMs is costly and often produces many ties where annotators cannot pick a winner. The authors rank prompts by the dissimilarity of two models' token probability distributions (KL divergence and cross-entropy). Prioritizing the top-ranked prompts reduces human 'tie' outcomes (up to ~54% reduction in the top 20% for some model pairs) and yields stable Elo rankings from far fewer annotations. Method works best when comparing closely related models (same family) and uses only model generation probabilities, so it is cheap to run before human labeling.

Problem Statement

Human pairwise preference annotation is expensive and often yields ties that give little signal. Can we pre-select the most informative prompts—using only model outputs—to reduce the number of human labels needed to decide which model is better?

Main Contribution

A simple offline ranking method that orders prompts by completion dissimilarity using KL divergence and cross-entropy computed from model token probabilities.

Empirical evidence that ranking reduces human tie outcomes in early annotations, up to ~54% reduction in top-20% for intra-family model comparisons.

Key Findings

Ranking prompts by KL divergence or cross-entropy reduces human 'tie' outcomes when annotating model pairs.

NumbersUp to 54.64% tie reduction (flan-t5 family, top 20%)

Practical UseCompute KL or cross-entropy on model outputs and label only the top-ranked 20–30% to get much fewer ties and faster decisions.

Evidence RefSection 4.1; Table 2; reported 54.64% for KL at 20%

Effect is strongest when comparing models from the same family (closely related models).

NumbersKL 54.64% & CE 51.15% tie decreases on flan-t5 pair (top 20%)

Practical UsePrioritize prompts especially when evaluating model variants or new checkpoints of the same base model.

Evidence RefSection 4.1; Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Tie reduction (KL / Cross-Entropy)	Up to 54.64% reduction in ties (KL) for flan-t5 pair at top 20%	Random prompt ordering	KL 54.64% vs random; CE 51.15% vs random (flan top 20%)	Flan-T5 intra-family, top 20% of prompts	Table 2; Section 4.1	Table 2
Tie reduction (dolly family)	KL reduced ties 22.36% and CE 6.83% at top 20%	Random ordering	KL 22.36% decrease; CE 6.83% decrease (dolly top 20%)	Dolly-v2 intra-family, top 20% of prompts	Section 4.1	Section 4.1

What To Try In 7 Days

Compute token log-probabilities for two candidate models over your prompt pool and rank prompts by KL divergence or cross-entropy.

Label only the top 20–30% ranked prompts and compare resulting Elo rankings to a small full-sample baseline.

Tune your tie aggregation threshold and annotator instructions, and measure annotator agreement to set thresholds like 0.2 (same family) or 0.1 (different families).

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Prioritization favors prompts with large differences and may under-represent common or boundary cases.

Gains are smaller for inter-family comparisons where tie rates are already low.

When Not To Use

When you need a representative assessment of overall model behavior rather than quick pairwise wins.

When models are too different and tie rates are already low—prioritization offers limited benefit.

Failure Modes

Overfitting evaluation to rare or adversarial prompts that exaggerate differences.

Misleading rankings if models' probability scales differ and normalization is not applied for inter-family comparisons.

Use model log-probabilities (KL / cross-entropy) to rank prompts and cut human evaluation cost by prioritizing decisive examples

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Ranking prompts by KL divergence or cross-entropy reduces human 'tie' outcomes when annotating model pairs.

Effect is strongest when comparing models from the same family (closely related models).

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Ranking prompts by KL divergence or cross-entropy reduces human 'tie' outcomes when annotating model pairs.

Effect is strongest when comparing models from the same family (closely related models).

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

You May Also Want to Read

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

SciIG: a benchmark that asks LLMs to draft research-paper introductions from title, abstract, and related work

Key finding

PersonaLens: a large benchmark and LLM-based user+judge agents to measure personalization in task-oriented assistants

Key finding

Use simple entropy-based reweighting to make cheap model judges match human preferences.

Key finding