LLMs favor certain option IDs, making multiple-choice evaluation brittle

September 7, 20237 min

Overview

Decision SnapshotReady For Pilot

The method is simple and inference-only, so it is easy to adopt; experiments cover many open models and three standard MCQ benchmarks, but gains vary by model and domain shift.

Citations22

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MCQ-format evaluation and automated grading can be unstable: models may pick "A" or "C" by habit, producing misleading scores. Fixing this improves model reliability with minimal compute.

Who Should Care

Summary TLDR

Modern LLMs show a strong selection bias in multiple-choice questions: they prefer some option IDs (e.g., 'A' or 'C') regardless of content. This makes simple changes like moving the correct answer between A/B/C/D cause large accuracy swings. The authors trace the main cause to token-level prior mass on option ID tokens and propose PriDe, a label-free, inference-time debiasing that estimates the model's ID prior on a small sample (e.g., 2–5%) by permuting options and then corrects future predictions. Evaluated on 20 LLMs across MMLU, ARC and CSQA, PriDe reduces imbalance in recalls and often raises accuracy with little extra compute.

Problem Statement

Multiple-choice evaluations assume models pick answers based on content. In practice many LLMs systematically prefer certain option IDs (selection bias). This makes MCQ scores unstable: moving the golden answer to a favored ID can raise accuracy tens of points for some models, and moving it to a disfavored ID can drop accuracy by several points (e.g., gpt-3.5-turbo drops 67.2→60.9 when correct moved to D).

Main Contribution

Demonstrate widespread selection bias in LLMs across 20 models and three MCQ benchmarks (MMLU, ARC, CSQA).

Pinpoint token bias on option ID tokens (e.g., 'A','B','C','D') as a primary intrinsic cause of selection bias; position bias is present but irregular.

Key Findings

Simple answer-moving changes cause large accuracy swings.

Numbersgpt-3.5-turbo MMLU: 67.260.9 (−6.3) when golden moved to D; llama-30B: 53.168.2 (+15.2) when moved to A

Practical UseDo not trust raw MCQ accuracy without testing for option-position sensitivity; run option-move tests to reveal instability.

Evidence RefTable 1

Selection bias is mainly driven by token-level priors on ID tokens, not just ordering.

NumbersRemoving option IDs markedly reduces recall std (RStd) across models (example RStd drop from 5.51.0 reported for gpt-3.

Practical UseExpect bias to persist when you prompt models to output ID tokens; consider debiasing the ID-token preference rather than only reordering options.

Evidence RefTable 2, §2.4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracygpt-3.5-turbo MMLU 67.260.9 (−6.3); llama-30B 53.168.2 (+15.2)default orderingexample drops/boosts up to ~15 ppMMLU (0-shot)Table 1 in paperTable 1
Recall imbalance (RStd) reduction by removing IDsExample: gpt-3.5-turbo RStd 5.51.0 (removing IDs)default prompt with A/B/C/DRStd drop up to multiple pointsMMLU / ARC (0-shot)Table 2, Table 3Table 2

What To Try In 7 Days

Run an 'answer-moving' test: move gold answers across A/B/C/D and record accuracy swings.

Measure recall balance (RStd) across option IDs to detect selection bias.

Implement PriDe: permute options on ~2–5% of live/test samples to estimate ID priors, then debias remaining predictions at inference.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

PriDe assumes the debiased content distribution is invariant to option order; this may not hold for all prompts or models.

Transfer of estimated priors can degrade under large domain shifts; re-estimation may be needed.

When Not To Use

When options refer to each other (e.g., 'A and B') or include 'none of the above', since permutations break semantics.

When you cannot permute options or change prompts in production (regulatory or UX constraints).

Failure Modes

Misestimated priors with too few estimation samples cause under- or over-correction.

Permutation during estimation can reduce prompt naturalness and temporarily lower performance on the estimation subset.

Core Entities

Models

gpt-3.5-turbo-0613llama-30Bllama-2-70Bllama-7Bllama-13Bllama-65Bllama-2-13Bvicuna-v1.3-33Bvicuna-v1.3-13Bfalcon-40Bfalcon-inst-40B

Metrics

RStd (std dev of recalls)Accuracy

Datasets

MMLUARCCommonsenseQA

Benchmarks

MMLUARC-ChallengeCSQA (CommonsenseQA)

Context Entities

Models

vicuna-v1.5falcon-7Bllama-2-chat variants