Overview
The method is simple and inference-only, so it is easy to adopt; experiments cover many open models and three standard MCQ benchmarks, but gains vary by model and domain shift.
Citations22
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
MCQ-format evaluation and automated grading can be unstable: models may pick "A" or "C" by habit, producing misleading scores. Fixing this improves model reliability with minimal compute.
Who Should Care
Summary TLDR
Modern LLMs show a strong selection bias in multiple-choice questions: they prefer some option IDs (e.g., 'A' or 'C') regardless of content. This makes simple changes like moving the correct answer between A/B/C/D cause large accuracy swings. The authors trace the main cause to token-level prior mass on option ID tokens and propose PriDe, a label-free, inference-time debiasing that estimates the model's ID prior on a small sample (e.g., 2–5%) by permuting options and then corrects future predictions. Evaluated on 20 LLMs across MMLU, ARC and CSQA, PriDe reduces imbalance in recalls and often raises accuracy with little extra compute.
Problem Statement
Multiple-choice evaluations assume models pick answers based on content. In practice many LLMs systematically prefer certain option IDs (selection bias). This makes MCQ scores unstable: moving the golden answer to a favored ID can raise accuracy tens of points for some models, and moving it to a disfavored ID can drop accuracy by several points (e.g., gpt-3.5-turbo drops 67.2→60.9 when correct moved to D).
Main Contribution
Demonstrate widespread selection bias in LLMs across 20 models and three MCQ benchmarks (MMLU, ARC, CSQA).
Pinpoint token bias on option ID tokens (e.g., 'A','B','C','D') as a primary intrinsic cause of selection bias; position bias is present but irregular.
Key Findings
Simple answer-moving changes cause large accuracy swings.
Selection bias is mainly driven by token-level priors on ID tokens, not just ordering.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | gpt-3.5-turbo MMLU 67.2 → 60.9 (−6.3); llama-30B 53.1 → 68.2 (+15.2) | default ordering | example drops/boosts up to ~15 pp | MMLU (0-shot) | Table 1 in paper | Table 1 |
| Recall imbalance (RStd) reduction by removing IDs | Example: gpt-3.5-turbo RStd 5.5 → 1.0 (removing IDs) | default prompt with A/B/C/D | RStd drop up to multiple points | MMLU / ARC (0-shot) | Table 2, Table 3 | Table 2 |
What To Try In 7 Days
Run an 'answer-moving' test: move gold answers across A/B/C/D and record accuracy swings.
Measure recall balance (RStd) across option IDs to detect selection bias.
Implement PriDe: permute options on ~2–5% of live/test samples to estimate ID priors, then debias remaining predictions at inference.
Reproducibility
Risks & Boundaries
Limitations
PriDe assumes the debiased content distribution is invariant to option order; this may not hold for all prompts or models.
Transfer of estimated priors can degrade under large domain shifts; re-estimation may be needed.
When Not To Use
When options refer to each other (e.g., 'A and B') or include 'none of the above', since permutations break semantics.
When you cannot permute options or change prompts in production (regulatory or UX constraints).
Failure Modes
Misestimated priors with too few estimation samples cause under- or over-correction.
Permutation during estimation can reduce prompt naturalness and temporarily lower performance on the estimation subset.

