Small prompt or format changes can reorder LLM leaderboards by many ranks

Overview

Decision SnapshotNeeds Validation

Solid empirical evidence across 11 models and two benchmarks shows real risk in relying on single MCQ leaderboards; results are reproducible via provided code but do not offer a complete fix.

Citations2

Evidence Strength0.90

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 40%

Novelty: 50%

Authors

Norah Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yusef Almushaykeh, Faisal Mirza, Nouf Alotaibi, Nora Altwairesh, Areeb Alowisheq, M Saiful Bari, Haidar Khan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you pick models by single MCQ leaderboard snapshots you risk choosing a weaker or misfitted model; small eval details can change rank and therefore cost and product outcomes.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

Leaderboards built from multiple‑choice benchmarks are brittle. Small, innocuous changes—reordering choices, swapping the letter symbols, or changing scoring—can move models up or down many ranks on MMLU and ARC. The paper catalogs three classes of tiny perturbations (choice order/IDs, prompt/scoring, and in‑context examples), measures their effects across 11 models, and recommends hybrid scoring and cautious interpretation of MCQ leaderboards. Code is available.

Problem Statement

Practitioners use MCQ leaderboards to pick expensive LLMs. But small, implementation‑level choices in prompts and scoring can massively change leaderboard order, risking wrong model selection and wasted cost.

Main Contribution

Systematic study showing MCQ leaderboard rankings are highly sensitive to small perturbations.

Isolation of three perturbation classes: answer choice format/order, prompt/scoring, and in‑context example content.

Key Findings

Minor perturbations can shift model ranks by many positions on MMLU.

NumbersUp to 8 rank positions change; example Yi‑6b moved 3→9 (Table 1)

Practical UseDo not trust single leaderboard snapshots for model selection; validate with multiple prompt/formats before buying or deploying a model.

Evidence RefFigure 1; Table 1

Leaderboards often disagree under small changes (Kendall kτ falls below stability threshold).

Numberskτ dropped to 0.564 under choice shuffles (kτ ≤ 0.75 indicates major disagreement)

Practical UseReport ranking sensitivity (e.g., Kendall τ) alongside scores to show stability before comparing models.

Evidence RefSection 5.1; Table A.5; Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Max rank displacement	Up to 8 positions (MMLU)	—	—	MMLU	Figure 1; Abstract	Figure 1; Table 1
Ranking agreement (Kendall kτ)	kτ = 0.564 after random choice shuffles	kτ = 1.0 (original)	−0.436	MMLU subset	Table 1; Section 5.1	Table 1

What To Try In 7 Days

Re-evaluate candidate models using hybrid scoring and report kτ to show ranking stability

Run 3 quick perturbations (shuffle choices, swap option symbols, and cloze vs symbol) and compare ranks

Sanitize few‑shot/context examples and rerun tests to detect leakage before deployment

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/National-Centerfor-AI-Saudi-Arabia/lm-evaluation-harness https://arxiv.org/pdf/2402.01781v2

Data URLs

MMLU (Hendrycks et al., 2020)ARC-Challenge (Clark et al., 2018)

Risks & Boundaries

Limitations

Cannot quantify root causes of token/position bias because pretraining data for models is not available

Proposed mitigation (hybrid scoring) reduces bias but is not a full solution

When Not To Use

When evaluating non‑MCQ tasks like freeform generation or long‑form reasoning

When you require a definitive, deployment‑grade ranking without further validation

Failure Modes

Leaderboard rank swaps due to answer ID tokens or choice ordering

High apparent accuracy driven by leaked answers in few‑shot context

Core Entities

Models

phi-2Yi-6bYi-34bMistral-7bMistral-7b-InstructLlama-2-7bLlama-2-7b-chatLlama-2-13bLlama-2-13b-chatLlama-2-70bLlama-2-70b-chat

Metrics

AccuracyKendall's τ (kτ)RStd (recall standard deviation)

Datasets

MMLU (Massive Multitask Language Understanding)ARC-Challenge (ARC-C)

Benchmarks

MMLUARC-C

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Minor perturbations can shift model ranks by many positions on MMLU.

Leaderboards often disagree under small changes (Kendall kτ falls below stability threshold).

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use all LLMs as judges: a fast, democratic way to rank models that matches human preference

Key finding

Judge with hidden states: use small models' internal vectors instead of prompting large LLMs

Key finding

DIBJUDGE: fine-tune judges to separate true quality signals from translation artifacts

Key finding

LLM judges favor 'new' and 'expert' labels but never admit it.

Key finding

Confuse the judge: a black-box method that labels LLM evaluations as high or low uncertainty

Key finding