Overview
Production Readiness
0.75
Novelty Score
0.45
Cost Impact Score
0.7
Citation Count
4
Why It Matters For Business
Prioritize prompts by model output dissimilarity to cut human labeling cost and time while preserving reliable model rankings, especially when comparing similar model variants.
Summary TLDR
Human pairwise evaluation of LLMs is costly and often produces many ties where annotators cannot pick a winner. The authors rank prompts by the dissimilarity of two models' token probability distributions (KL divergence and cross-entropy). Prioritizing the top-ranked prompts reduces human 'tie' outcomes (up to ~54% reduction in the top 20% for some model pairs) and yields stable Elo rankings from far fewer annotations. Method works best when comparing closely related models (same family) and uses only model generation probabilities, so it is cheap to run before human labeling.
Problem Statement
Human pairwise preference annotation is expensive and often yields ties that give little signal. Can we pre-select the most informative prompts—using only model outputs—to reduce the number of human labels needed to decide which model is better?
Main Contribution
A simple offline ranking method that orders prompts by completion dissimilarity using KL divergence and cross-entropy computed from model token probabilities.
Empirical evidence that ranking reduces human tie outcomes in early annotations, up to ~54% reduction in top-20% for intra-family model comparisons.
Demonstration that prioritized subsets (top ~20–30%) produce Elo scores consistent with full-data 'gold standard' rankings, enabling fewer human labels.
Key Findings
Ranking prompts by KL divergence or cross-entropy reduces human 'tie' outcomes when annotating model pairs.
Effect is strongest when comparing models from the same family (closely related models).
Prioritized subsets preserve model rankings (Elo) compared to full annotation.
Inter-family comparisons give smaller gains because tie rates are already lower.
Human annotator agreement and tie thresholds affect outcomes and aggregation.
Results
Tie reduction (KL / Cross-Entropy)
Tie reduction (dolly family)
Inter-family tie change
Elo score stability
Who Should Care
What To Try In 7 Days
Compute token log-probabilities for two candidate models over your prompt pool and rank prompts by KL divergence or cross-entropy.
Label only the top 20–30% ranked prompts and compare resulting Elo rankings to a small full-sample baseline.
Tune your tie aggregation threshold and annotator instructions, and measure annotator agreement to set thresholds like 0.2 (same family) or 0.1 (different families).
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Prioritization favors prompts with large differences and may under-represent common or boundary cases.
- Gains are smaller for inter-family comparisons where tie rates are already low.
- Requires access to model token probabilities and comparable decoding settings across models.
- Human annotator variability (≈70% agreement) and chosen aggregation thresholds affect outcomes.
When Not To Use
- When you need a representative assessment of overall model behavior rather than quick pairwise wins.
- When models are too different and tie rates are already low—prioritization offers limited benefit.
- If you cannot compute or trust token log-probabilities for one or both models (e.g., black-box APIs without probabilities).
Failure Modes
- Overfitting evaluation to rare or adversarial prompts that exaggerate differences.
- Misleading rankings if models' probability scales differ and normalization is not applied for inter-family comparisons.
- Annotation bias if top-ranked prompts systematically target a narrow skill or genre.
Core Entities
Models
- flan-t5-xxl
- flan-t5-xl
- dolly-v2-12b
- dolly-v2-7b
- mpt-7b-instruct
- falcon-7b-instruct
Metrics
- KL Divergence
- Cross-Entropy
- Elo
Datasets
- Soda
- P3
- CommonsenseQA
- CommonGen
- AdversarialQA
Context Entities
Models
- T5
- Pythia
- MPT
- Falcon

