Small prompt or format changes can reorder LLM leaderboards by many ranks

February 1, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

2

Authors

Norah Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yusef Almushaykeh, Faisal Mirza, Nouf Alotaibi, Nora Altwairesh, Areeb Alowisheq, M Saiful Bari, Haidar Khan

Links

Abstract / PDF

Why It Matters For Business

If you pick models by single MCQ leaderboard snapshots you risk choosing a weaker or misfitted model; small eval details can change rank and therefore cost and product outcomes.

Summary TLDR

Leaderboards built from multiple‑choice benchmarks are brittle. Small, innocuous changes—reordering choices, swapping the letter symbols, or changing scoring—can move models up or down many ranks on MMLU and ARC. The paper catalogs three classes of tiny perturbations (choice order/IDs, prompt/scoring, and in‑context examples), measures their effects across 11 models, and recommends hybrid scoring and cautious interpretation of MCQ leaderboards. Code is available.

Problem Statement

Practitioners use MCQ leaderboards to pick expensive LLMs. But small, implementation‑level choices in prompts and scoring can massively change leaderboard order, risking wrong model selection and wasted cost.

Main Contribution

Systematic study showing MCQ leaderboard rankings are highly sensitive to small perturbations.

Isolation of three perturbation classes: answer choice format/order, prompt/scoring, and in‑context example content.

Empirical evidence across 11 models on MMLU and ARC showing selection bias to symbols, positions, and scoring styles.

Practical recommendation to prefer hybrid scoring (reduces selection bias) and to treat MCQ leaderboards with caution.

Public release of evaluation code and configs to reproduce tests.

Key Findings

Minor perturbations can shift model ranks by many positions on MMLU.

NumbersUp to 8 rank positions change; example Yi‑6b moved 3→9 (Table 1)

Leaderboards often disagree under small changes (Kendall kτ falls below stability threshold).

Numberskτ dropped to 0.564 under choice shuffles (kτ ≤ 0.75 indicates major disagreement)

Choice ID tokens and choice positions cause selection bias in all tested models.

NumbersReplacing A/B/C/D with rare symbols reduced accuracy and increased RStd (e.g., ∆Acc up to −11.3, ∆RStd +35, Fig.4/TableA

Scoring method strongly affects bias and accuracy.

NumbersSymbol scoring gave highest accuracy but highest RStd; cloze lowered bias but also accuracy; hybrid reduced bias while保持

Models copy answers shown in context and are misled by incorrect context.

NumbersProviding correct one‑shot/5‑shot examples raised accuracy near 90–99% for some models; providing incorrect examples cut

Small prompt text edits and some few‑shot variations have little effect on rankings.

Numberskτ > 0.9 for prompt instruction edits and some few‑shot tweaks

Results

Max rank displacement

ValueUp to 8 positions (MMLU)

Ranking agreement (Kendall kτ)

Valuekτ = 0.564 after random choice shuffles

Baselinekτ = 1.0 (original)

Accuracy

Value∆Acc down to −11.3, ∆RStd up to +35

BaselineSymbols A/B/C/D

Effect of scoring method

ValueSymbol scoring: highest accuracy but highest bias; Cloze: lowest bias but lower accuracy; Hybrid: reduced bias vs symbol

BaselineSymbol scoring

In‑context cheating (one‑shot / five‑shot with correct example)

ValueModel accuracies often rose to ~90–99% with correct examples (e.g., Mistral‑7B 97–99%)

Baselinezero/few‑shot baseline

Prompt instruction edits

Valuekτ > 0.9 (minimal ranking change)

Baselineoriginal prompt

Who Should Care

What To Try In 7 Days

Re-evaluate candidate models using hybrid scoring and report kτ to show ranking stability

Run 3 quick perturbations (shuffle choices, swap option symbols, and cloze vs symbol) and compare ranks

Sanitize few‑shot/context examples and rerun tests to detect leakage before deployment

Reproducibility

Data Urls

  • MMLU (Hendrycks et al., 2020)
  • ARC-Challenge (Clark et al., 2018)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Cannot quantify root causes of token/position bias because pretraining data for models is not available
  • Proposed mitigation (hybrid scoring) reduces bias but is not a full solution
  • Experiments focus mainly on MMLU and ARC-C; other tasks may behave differently

When Not To Use

  • When evaluating non‑MCQ tasks like freeform generation or long‑form reasoning
  • When you require a definitive, deployment‑grade ranking without further validation

Failure Modes

  • Leaderboard rank swaps due to answer ID tokens or choice ordering
  • High apparent accuracy driven by leaked answers in few‑shot context
  • Scoring scheme choice creates misleading tradeoffs between bias and raw accuracy

Core Entities

Models

  • phi-2
  • Yi-6b
  • Yi-34b
  • Mistral-7b
  • Mistral-7b-Instruct
  • Llama-2-7b
  • Llama-2-7b-chat
  • Llama-2-13b
  • Llama-2-13b-chat
  • Llama-2-70b
  • Llama-2-70b-chat

Metrics

  • Accuracy
  • Kendall's τ (kτ)
  • RStd (recall standard deviation)

Datasets

  • MMLU (Massive Multitask Language Understanding)
  • ARC-Challenge (ARC-C)

Benchmarks

  • MMLU
  • ARC-C