Use model log-probabilities (KL / cross-entropy) to rank prompts and cut human evaluation cost by prioritizing decisive examples

October 22, 20237 min

Overview

Production Readiness

0.75

Novelty Score

0.45

Cost Impact Score

0.7

Citation Count

4

Authors

Meriem Boubdir, Edward Kim, Beyza Ermis, Marzieh Fadaee, Sara Hooker

Links

Abstract / PDF

Why It Matters For Business

Prioritize prompts by model output dissimilarity to cut human labeling cost and time while preserving reliable model rankings, especially when comparing similar model variants.

Summary TLDR

Human pairwise evaluation of LLMs is costly and often produces many ties where annotators cannot pick a winner. The authors rank prompts by the dissimilarity of two models' token probability distributions (KL divergence and cross-entropy). Prioritizing the top-ranked prompts reduces human 'tie' outcomes (up to ~54% reduction in the top 20% for some model pairs) and yields stable Elo rankings from far fewer annotations. Method works best when comparing closely related models (same family) and uses only model generation probabilities, so it is cheap to run before human labeling.

Problem Statement

Human pairwise preference annotation is expensive and often yields ties that give little signal. Can we pre-select the most informative prompts—using only model outputs—to reduce the number of human labels needed to decide which model is better?

Main Contribution

A simple offline ranking method that orders prompts by completion dissimilarity using KL divergence and cross-entropy computed from model token probabilities.

Empirical evidence that ranking reduces human tie outcomes in early annotations, up to ~54% reduction in top-20% for intra-family model comparisons.

Demonstration that prioritized subsets (top ~20–30%) produce Elo scores consistent with full-data 'gold standard' rankings, enabling fewer human labels.

Key Findings

Ranking prompts by KL divergence or cross-entropy reduces human 'tie' outcomes when annotating model pairs.

NumbersUp to 54.64% tie reduction (flan-t5 family, top 20%)

Effect is strongest when comparing models from the same family (closely related models).

NumbersKL 54.64% & CE 51.15% tie decreases on flan-t5 pair (top 20%)

Prioritized subsets preserve model rankings (Elo) compared to full annotation.

NumbersTop 20–30% prioritized prompts recover gold-standard Elo ordering in experiments

Inter-family comparisons give smaller gains because tie rates are already lower.

NumbersCross-entropy reduced ties by 9.85% for flan-t5-xxl vs dolly-v2-12b at 20%

Human annotator agreement and tie thresholds affect outcomes and aggregation.

NumbersAnnotator agreement ≈70%; tie threshold set to 0.2 (same family) or 0.1 (different families)

Results

Tie reduction (KL / Cross-Entropy)

ValueUp to 54.64% reduction in ties (KL) for flan-t5 pair at top 20%

BaselineRandom prompt ordering

Tie reduction (dolly family)

ValueKL reduced ties 22.36% and CE 6.83% at top 20%

BaselineRandom ordering

Inter-family tie change

ValueCross-Entropy reduced ties 9.85% (flan-t5-xxl vs dolly-v2-12b at 20%)

BaselineRandom ordering

Elo score stability

ValueTop 20–30% prioritized prompts recover full-data Elo ordering

BaselineElo from 100% human annotations (gold standard)

Who Should Care

What To Try In 7 Days

Compute token log-probabilities for two candidate models over your prompt pool and rank prompts by KL divergence or cross-entropy.

Label only the top 20–30% ranked prompts and compare resulting Elo rankings to a small full-sample baseline.

Tune your tie aggregation threshold and annotator instructions, and measure annotator agreement to set thresholds like 0.2 (same family) or 0.1 (different families).

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Prioritization favors prompts with large differences and may under-represent common or boundary cases.
  • Gains are smaller for inter-family comparisons where tie rates are already low.
  • Requires access to model token probabilities and comparable decoding settings across models.
  • Human annotator variability (≈70% agreement) and chosen aggregation thresholds affect outcomes.

When Not To Use

  • When you need a representative assessment of overall model behavior rather than quick pairwise wins.
  • When models are too different and tie rates are already low—prioritization offers limited benefit.
  • If you cannot compute or trust token log-probabilities for one or both models (e.g., black-box APIs without probabilities).

Failure Modes

  • Overfitting evaluation to rare or adversarial prompts that exaggerate differences.
  • Misleading rankings if models' probability scales differ and normalization is not applied for inter-family comparisons.
  • Annotation bias if top-ranked prompts systematically target a narrow skill or genre.

Core Entities

Models

  • flan-t5-xxl
  • flan-t5-xl
  • dolly-v2-12b
  • dolly-v2-7b
  • mpt-7b-instruct
  • falcon-7b-instruct

Metrics

  • KL Divergence
  • Cross-Entropy
  • Elo

Datasets

  • Soda
  • P3
  • CommonsenseQA
  • CommonGen
  • AdversarialQA

Context Entities

Models

  • T5
  • Pythia
  • MPT
  • Falcon