Overview
Solid empirical evidence across 11 models and two benchmarks shows real risk in relying on single MCQ leaderboards; results are reproducible via provided code but do not offer a complete fix.
Citations2
Evidence Strength0.90
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 5/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 40%
Novelty: 50%
Why It Matters For Business
If you pick models by single MCQ leaderboard snapshots you risk choosing a weaker or misfitted model; small eval details can change rank and therefore cost and product outcomes.
Who Should Care
Summary TLDR
Leaderboards built from multiple‑choice benchmarks are brittle. Small, innocuous changes—reordering choices, swapping the letter symbols, or changing scoring—can move models up or down many ranks on MMLU and ARC. The paper catalogs three classes of tiny perturbations (choice order/IDs, prompt/scoring, and in‑context examples), measures their effects across 11 models, and recommends hybrid scoring and cautious interpretation of MCQ leaderboards. Code is available.
Problem Statement
Practitioners use MCQ leaderboards to pick expensive LLMs. But small, implementation‑level choices in prompts and scoring can massively change leaderboard order, risking wrong model selection and wasted cost.
Main Contribution
Systematic study showing MCQ leaderboard rankings are highly sensitive to small perturbations.
Isolation of three perturbation classes: answer choice format/order, prompt/scoring, and in‑context example content.
Key Findings
Minor perturbations can shift model ranks by many positions on MMLU.
Leaderboards often disagree under small changes (Kendall kτ falls below stability threshold).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Max rank displacement | Up to 8 positions (MMLU) | — | — | MMLU | Figure 1; Abstract | Figure 1; Table 1 |
| Ranking agreement (Kendall kτ) | kτ = 0.564 after random choice shuffles | kτ = 1.0 (original) | −0.436 | MMLU subset | Table 1; Section 5.1 | Table 1 |
What To Try In 7 Days
Re-evaluate candidate models using hybrid scoring and report kτ to show ranking stability
Run 3 quick perturbations (shuffle choices, swap option symbols, and cloze vs symbol) and compare ranks
Sanitize few‑shot/context examples and rerun tests to detect leakage before deployment
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Cannot quantify root causes of token/position bias because pretraining data for models is not available
Proposed mitigation (hybrid scoring) reduces bias but is not a full solution
When Not To Use
When evaluating non‑MCQ tasks like freeform generation or long‑form reasoning
When you require a definitive, deployment‑grade ranking without further validation
Failure Modes
Leaderboard rank swaps due to answer ID tokens or choice ordering
High apparent accuracy driven by leaked answers in few‑shot context

