Use model log-probabilities (KL / cross-entropy) to rank prompts and cut human evaluation cost by prioritizing decisive examples

October 22, 20237 min

Overview

Decision SnapshotNeeds Validation

Method is simple, uses only model generation probabilities, and shows consistent efficiency gains; strongest when comparing similar model variants.

Citations4

Evidence Strength0.70

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 75%

Novelty: 45%

Authors

Meriem Boubdir, Edward Kim, Beyza Ermis, Marzieh Fadaee, Sara Hooker

Links

Abstract / PDF

Why It Matters For Business

Prioritize prompts by model output dissimilarity to cut human labeling cost and time while preserving reliable model rankings, especially when comparing similar model variants.

Who Should Care

Summary TLDR

Human pairwise evaluation of LLMs is costly and often produces many ties where annotators cannot pick a winner. The authors rank prompts by the dissimilarity of two models' token probability distributions (KL divergence and cross-entropy). Prioritizing the top-ranked prompts reduces human 'tie' outcomes (up to ~54% reduction in the top 20% for some model pairs) and yields stable Elo rankings from far fewer annotations. Method works best when comparing closely related models (same family) and uses only model generation probabilities, so it is cheap to run before human labeling.

Problem Statement

Human pairwise preference annotation is expensive and often yields ties that give little signal. Can we pre-select the most informative prompts—using only model outputs—to reduce the number of human labels needed to decide which model is better?

Main Contribution

A simple offline ranking method that orders prompts by completion dissimilarity using KL divergence and cross-entropy computed from model token probabilities.

Empirical evidence that ranking reduces human tie outcomes in early annotations, up to ~54% reduction in top-20% for intra-family model comparisons.

Key Findings

Ranking prompts by KL divergence or cross-entropy reduces human 'tie' outcomes when annotating model pairs.

NumbersUp to 54.64% tie reduction (flan-t5 family, top 20%)

Practical UseCompute KL or cross-entropy on model outputs and label only the top-ranked 20–30% to get much fewer ties and faster decisions.

Evidence RefSection 4.1; Table 2; reported 54.64% for KL at 20%

Effect is strongest when comparing models from the same family (closely related models).

NumbersKL 54.64% & CE 51.15% tie decreases on flan-t5 pair (top 20%)

Practical UsePrioritize prompts especially when evaluating model variants or new checkpoints of the same base model.

Evidence RefSection 4.1; Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Tie reduction (KL / Cross-Entropy)Up to 54.64% reduction in ties (KL) for flan-t5 pair at top 20%Random prompt orderingKL 54.64% vs random; CE 51.15% vs random (flan top 20%)Flan-T5 intra-family, top 20% of promptsTable 2; Section 4.1Table 2
Tie reduction (dolly family)KL reduced ties 22.36% and CE 6.83% at top 20%Random orderingKL 22.36% decrease; CE 6.83% decrease (dolly top 20%)Dolly-v2 intra-family, top 20% of promptsSection 4.1Section 4.1

What To Try In 7 Days

Compute token log-probabilities for two candidate models over your prompt pool and rank prompts by KL divergence or cross-entropy.

Label only the top 20–30% ranked prompts and compare resulting Elo rankings to a small full-sample baseline.

Tune your tie aggregation threshold and annotator instructions, and measure annotator agreement to set thresholds like 0.2 (same family) or 0.1 (different families).

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Prioritization favors prompts with large differences and may under-represent common or boundary cases.

Gains are smaller for inter-family comparisons where tie rates are already low.

When Not To Use

When you need a representative assessment of overall model behavior rather than quick pairwise wins.

When models are too different and tie rates are already low—prioritization offers limited benefit.

Failure Modes

Overfitting evaluation to rare or adversarial prompts that exaggerate differences.

Misleading rankings if models' probability scales differ and normalization is not applied for inter-family comparisons.

Core Entities

Models

flan-t5-xxlflan-t5-xldolly-v2-12bdolly-v2-7bmpt-7b-instructfalcon-7b-instruct

Metrics

KL DivergenceCross-EntropyElo

Datasets

SodaP3CommonsenseQACommonGenAdversarialQA

Context Entities

Models

T5PythiaMPTFalcon