Overview
Method is simple, uses only model generation probabilities, and shows consistent efficiency gains; strongest when comparing similar model variants.
Citations4
Evidence Strength0.70
Confidence0.82
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 70%
Production readiness: 75%
Novelty: 45%
Why It Matters For Business
Prioritize prompts by model output dissimilarity to cut human labeling cost and time while preserving reliable model rankings, especially when comparing similar model variants.
Who Should Care
Summary TLDR
Human pairwise evaluation of LLMs is costly and often produces many ties where annotators cannot pick a winner. The authors rank prompts by the dissimilarity of two models' token probability distributions (KL divergence and cross-entropy). Prioritizing the top-ranked prompts reduces human 'tie' outcomes (up to ~54% reduction in the top 20% for some model pairs) and yields stable Elo rankings from far fewer annotations. Method works best when comparing closely related models (same family) and uses only model generation probabilities, so it is cheap to run before human labeling.
Problem Statement
Human pairwise preference annotation is expensive and often yields ties that give little signal. Can we pre-select the most informative prompts—using only model outputs—to reduce the number of human labels needed to decide which model is better?
Main Contribution
A simple offline ranking method that orders prompts by completion dissimilarity using KL divergence and cross-entropy computed from model token probabilities.
Empirical evidence that ranking reduces human tie outcomes in early annotations, up to ~54% reduction in top-20% for intra-family model comparisons.
Key Findings
Ranking prompts by KL divergence or cross-entropy reduces human 'tie' outcomes when annotating model pairs.
Effect is strongest when comparing models from the same family (closely related models).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Tie reduction (KL / Cross-Entropy) | Up to 54.64% reduction in ties (KL) for flan-t5 pair at top 20% | Random prompt ordering | KL 54.64% vs random; CE 51.15% vs random (flan top 20%) | Flan-T5 intra-family, top 20% of prompts | Table 2; Section 4.1 | Table 2 |
| Tie reduction (dolly family) | KL reduced ties 22.36% and CE 6.83% at top 20% | Random ordering | KL 22.36% decrease; CE 6.83% decrease (dolly top 20%) | Dolly-v2 intra-family, top 20% of prompts | Section 4.1 | Section 4.1 |
What To Try In 7 Days
Compute token log-probabilities for two candidate models over your prompt pool and rank prompts by KL divergence or cross-entropy.
Label only the top 20–30% ranked prompts and compare resulting Elo rankings to a small full-sample baseline.
Tune your tie aggregation threshold and annotator instructions, and measure annotator agreement to set thresholds like 0.2 (same family) or 0.1 (different families).
Reproducibility
Risks & Boundaries
Limitations
Prioritization favors prompts with large differences and may under-represent common or boundary cases.
Gains are smaller for inter-family comparisons where tie rates are already low.
When Not To Use
When you need a representative assessment of overall model behavior rather than quick pairwise wins.
When models are too different and tie rates are already low—prioritization offers limited benefit.
Failure Modes
Overfitting evaluation to rare or adversarial prompts that exaggerate differences.
Misleading rankings if models' probability scales differ and normalization is not applied for inter-family comparisons.

