Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
22
Why It Matters For Business
MCQ-format evaluation and automated grading can be unstable: models may pick "A" or "C" by habit, producing misleading scores. Fixing this improves model reliability with minimal compute.
Summary TLDR
Modern LLMs show a strong selection bias in multiple-choice questions: they prefer some option IDs (e.g., 'A' or 'C') regardless of content. This makes simple changes like moving the correct answer between A/B/C/D cause large accuracy swings. The authors trace the main cause to token-level prior mass on option ID tokens and propose PriDe, a label-free, inference-time debiasing that estimates the model's ID prior on a small sample (e.g., 2–5%) by permuting options and then corrects future predictions. Evaluated on 20 LLMs across MMLU, ARC and CSQA, PriDe reduces imbalance in recalls and often raises accuracy with little extra compute.
Problem Statement
Multiple-choice evaluations assume models pick answers based on content. In practice many LLMs systematically prefer certain option IDs (selection bias). This makes MCQ scores unstable: moving the golden answer to a favored ID can raise accuracy tens of points for some models, and moving it to a disfavored ID can drop accuracy by several points (e.g., gpt-3.5-turbo drops 67.2→60.9 when correct moved to D).
Main Contribution
Demonstrate widespread selection bias in LLMs across 20 models and three MCQ benchmarks (MMLU, ARC, CSQA).
Pinpoint token bias on option ID tokens (e.g., 'A','B','C','D') as a primary intrinsic cause of selection bias; position bias is present but irregular.
Propose PriDe: a label-free, inference-time debiasing that estimates ID priors from a small test subset and corrects predictions with negligible extra cost.
Key Findings
Simple answer-moving changes cause large accuracy swings.
Selection bias is mainly driven by token-level priors on ID tokens, not just ordering.
PriDe can reduce recall imbalance and often improve accuracy with little compute.
Simple prompt hacks (explicit debias instruction or CoT) do not reliably fix selection bias.
Estimated priors transfer across domains reasonably well but can degrade if domain gap is large.
Results
Accuracy
Recall imbalance (RStd) reduction by removing IDs
PriDe effectiveness (avg across models)
Who Should Care
What To Try In 7 Days
Run an 'answer-moving' test: move gold answers across A/B/C/D and record accuracy swings.
Measure recall balance (RStd) across option IDs to detect selection bias.
Implement PriDe: permute options on ~2–5% of live/test samples to estimate ID priors, then debias remaining predictions at inference.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- PriDe assumes the debiased content distribution is invariant to option order; this may not hold for all prompts or models.
- Transfer of estimated priors can degrade under large domain shifts; re-estimation may be needed.
- For closed APIs that do not return token probabilities, PriDe must approximate priors by sampling outputs.
When Not To Use
- When options refer to each other (e.g., 'A and B') or include 'none of the above', since permutations break semantics.
- When you cannot permute options or change prompts in production (regulatory or UX constraints).
- If you require absolute highest accuracy and are unwilling to change evaluation protocol; PriDe focuses on robustness and fairness of positions.
Failure Modes
- Misestimated priors with too few estimation samples cause under- or over-correction.
- Permutation during estimation can reduce prompt naturalness and temporarily lower performance on the estimation subset.
- Position bias not captured by token prior may remain and vary irregularly across models/tasks.
Core Entities
Models
- gpt-3.5-turbo-0613
- llama-30B
- llama-2-70B
- llama-7B
- llama-13B
- llama-65B
- llama-2-13B
- vicuna-v1.3-33B
- vicuna-v1.3-13B
- falcon-40B
- falcon-inst-40B
Metrics
- RStd (std dev of recalls)
- Accuracy
Datasets
- MMLU
- ARC
- CommonsenseQA
Benchmarks
- MMLU
- ARC-Challenge
- CSQA (CommonsenseQA)
Context Entities
Models
- vicuna-v1.5
- falcon-7B
- llama-2-chat variants

