LLMs favor certain option IDs, making multiple-choice evaluation brittle

September 7, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

22

Authors

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang

Links

Abstract / PDF

Why It Matters For Business

MCQ-format evaluation and automated grading can be unstable: models may pick "A" or "C" by habit, producing misleading scores. Fixing this improves model reliability with minimal compute.

Summary TLDR

Modern LLMs show a strong selection bias in multiple-choice questions: they prefer some option IDs (e.g., 'A' or 'C') regardless of content. This makes simple changes like moving the correct answer between A/B/C/D cause large accuracy swings. The authors trace the main cause to token-level prior mass on option ID tokens and propose PriDe, a label-free, inference-time debiasing that estimates the model's ID prior on a small sample (e.g., 2–5%) by permuting options and then corrects future predictions. Evaluated on 20 LLMs across MMLU, ARC and CSQA, PriDe reduces imbalance in recalls and often raises accuracy with little extra compute.

Problem Statement

Multiple-choice evaluations assume models pick answers based on content. In practice many LLMs systematically prefer certain option IDs (selection bias). This makes MCQ scores unstable: moving the golden answer to a favored ID can raise accuracy tens of points for some models, and moving it to a disfavored ID can drop accuracy by several points (e.g., gpt-3.5-turbo drops 67.2→60.9 when correct moved to D).

Main Contribution

Demonstrate widespread selection bias in LLMs across 20 models and three MCQ benchmarks (MMLU, ARC, CSQA).

Pinpoint token bias on option ID tokens (e.g., 'A','B','C','D') as a primary intrinsic cause of selection bias; position bias is present but irregular.

Propose PriDe: a label-free, inference-time debiasing that estimates ID priors from a small test subset and corrects predictions with negligible extra cost.

Key Findings

Simple answer-moving changes cause large accuracy swings.

Numbersgpt-3.5-turbo MMLU: 67.2 → 60.9 (−6.3) when golden moved to D; llama-30B: 53.1 → 68.2 (+15.2) when moved to A

Selection bias is mainly driven by token-level priors on ID tokens, not just ordering.

NumbersRemoving option IDs markedly reduces recall std (RStd) across models (example RStd drop from 5.5→1.0 reported for gpt-3.

PriDe can reduce recall imbalance and often improve accuracy with little compute.

NumbersPriDe (using small K like 5% of samples) reduces average RStd by roughly 6–9 points and can raise accuracy by ~1–4 pp on

Simple prompt hacks (explicit debias instruction or CoT) do not reliably fix selection bias.

NumbersDebias instruction and Chain-of-Thought produce little RStd reduction compared to PriDe (Table 2 shows modest changes)

Estimated priors transfer across domains reasonably well but can degrade if domain gap is large.

NumbersCross-domain debiasing shows reasonable transfer but larger domain gaps (e.g., STEM→ARC) may lower accuracy (Figure 5)

Results

Accuracy

Valuegpt-3.5-turbo MMLU 67.2 → 60.9 (−6.3); llama-30B 53.1 → 68.2 (+15.2)

Baselinedefault ordering

Recall imbalance (RStd) reduction by removing IDs

ValueExample: gpt-3.5-turbo RStd 5.5 → 1.0 (removing IDs)

Baselinedefault prompt with A/B/C/D

PriDe effectiveness (avg across models)

ValuePriDe using small prior sample (e.g., 5%) reduces avg RStd ≈ 6–9 points and often raises accuracy by ≈1–4 pp

Baselinedefault predictions

Who Should Care

What To Try In 7 Days

Run an 'answer-moving' test: move gold answers across A/B/C/D and record accuracy swings.

Measure recall balance (RStd) across option IDs to detect selection bias.

Implement PriDe: permute options on ~2–5% of live/test samples to estimate ID priors, then debias remaining predictions at inference.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • PriDe assumes the debiased content distribution is invariant to option order; this may not hold for all prompts or models.
  • Transfer of estimated priors can degrade under large domain shifts; re-estimation may be needed.
  • For closed APIs that do not return token probabilities, PriDe must approximate priors by sampling outputs.

When Not To Use

  • When options refer to each other (e.g., 'A and B') or include 'none of the above', since permutations break semantics.
  • When you cannot permute options or change prompts in production (regulatory or UX constraints).
  • If you require absolute highest accuracy and are unwilling to change evaluation protocol; PriDe focuses on robustness and fairness of positions.

Failure Modes

  • Misestimated priors with too few estimation samples cause under- or over-correction.
  • Permutation during estimation can reduce prompt naturalness and temporarily lower performance on the estimation subset.
  • Position bias not captured by token prior may remain and vary irregularly across models/tasks.

Core Entities

Models

  • gpt-3.5-turbo-0613
  • llama-30B
  • llama-2-70B
  • llama-7B
  • llama-13B
  • llama-65B
  • llama-2-13B
  • vicuna-v1.3-33B
  • vicuna-v1.3-13B
  • falcon-40B
  • falcon-inst-40B

Metrics

  • RStd (std dev of recalls)
  • Accuracy

Datasets

  • MMLU
  • ARC
  • CommonsenseQA

Benchmarks

  • MMLU
  • ARC-Challenge
  • CSQA (CommonsenseQA)

Context Entities

Models

  • vicuna-v1.5
  • falcon-7B
  • llama-2-chat variants