LLMs favor certain option IDs, making multiple-choice evaluation brittle

Overview

Decision SnapshotReady For Pilot

The method is simple and inference-only, so it is easy to adopt; experiments cover many open models and three standard MCQ benchmarks, but gains vary by model and domain shift.

Citations22

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MCQ-format evaluation and automated grading can be unstable: models may pick "A" or "C" by habit, producing misleading scores. Fixing this improves model reliability with minimal compute.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

Modern LLMs show a strong selection bias in multiple-choice questions: they prefer some option IDs (e.g., 'A' or 'C') regardless of content. This makes simple changes like moving the correct answer between A/B/C/D cause large accuracy swings. The authors trace the main cause to token-level prior mass on option ID tokens and propose PriDe, a label-free, inference-time debiasing that estimates the model's ID prior on a small sample (e.g., 2–5%) by permuting options and then corrects future predictions. Evaluated on 20 LLMs across MMLU, ARC and CSQA, PriDe reduces imbalance in recalls and often raises accuracy with little extra compute.

Problem Statement

Multiple-choice evaluations assume models pick answers based on content. In practice many LLMs systematically prefer certain option IDs (selection bias). This makes MCQ scores unstable: moving the golden answer to a favored ID can raise accuracy tens of points for some models, and moving it to a disfavored ID can drop accuracy by several points (e.g., gpt-3.5-turbo drops 67.2→60.9 when correct moved to D).

Main Contribution

Demonstrate widespread selection bias in LLMs across 20 models and three MCQ benchmarks (MMLU, ARC, CSQA).

Pinpoint token bias on option ID tokens (e.g., 'A','B','C','D') as a primary intrinsic cause of selection bias; position bias is present but irregular.

Key Findings

Simple answer-moving changes cause large accuracy swings.

Numbersgpt-3.5-turbo MMLU: 67.2 → 60.9 (−6.3) when golden moved to D; llama-30B: 53.1 → 68.2 (+15.2) when moved to A

Practical UseDo not trust raw MCQ accuracy without testing for option-position sensitivity; run option-move tests to reveal instability.

Evidence RefTable 1

Selection bias is mainly driven by token-level priors on ID tokens, not just ordering.

NumbersRemoving option IDs markedly reduces recall std (RStd) across models (example RStd drop from 5.5→1.0 reported for gpt-3.

Practical UseExpect bias to persist when you prompt models to output ID tokens; consider debiasing the ID-token preference rather than only reordering options.

Evidence RefTable 2, §2.4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	gpt-3.5-turbo MMLU 67.2 → 60.9 (−6.3); llama-30B 53.1 → 68.2 (+15.2)	default ordering	example drops/boosts up to ~15 pp	MMLU (0-shot)	Table 1 in paper	Table 1
Recall imbalance (RStd) reduction by removing IDs	Example: gpt-3.5-turbo RStd 5.5 → 1.0 (removing IDs)	default prompt with A/B/C/D	RStd drop up to multiple points	MMLU / ARC (0-shot)	Table 2, Table 3	Table 2

What To Try In 7 Days

Run an 'answer-moving' test: move gold answers across A/B/C/D and record accuracy swings.

Measure recall balance (RStd) across option IDs to detect selection bias.

Implement PriDe: permute options on ~2–5% of live/test samples to estimate ID priors, then debias remaining predictions at inference.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/chujiezheng/LLM-MCQ-Bias

Data URLs

https://github.com/chujiezheng/LLM-MCQ-Bias (preprocessed benchmarks and scripts)

Risks & Boundaries

Limitations

PriDe assumes the debiased content distribution is invariant to option order; this may not hold for all prompts or models.

Transfer of estimated priors can degrade under large domain shifts; re-estimation may be needed.

When Not To Use

When options refer to each other (e.g., 'A and B') or include 'none of the above', since permutations break semantics.

When you cannot permute options or change prompts in production (regulatory or UX constraints).

Failure Modes

Misestimated priors with too few estimation samples cause under- or over-correction.

Permutation during estimation can reduce prompt naturalness and temporarily lower performance on the estimation subset.

Core Entities

Models

gpt-3.5-turbo-0613llama-30Bllama-2-70Bllama-7Bllama-13Bllama-65Bllama-2-13Bvicuna-v1.3-33Bvicuna-v1.3-13Bfalcon-40Bfalcon-inst-40B

Metrics

RStd (std dev of recalls)Accuracy

Datasets

MMLUARCCommonsenseQA

Benchmarks

MMLUARC-ChallengeCSQA (CommonsenseQA)

Context Entities

Models

vicuna-v1.5falcon-7Bllama-2-chat variants

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Simple answer-moving changes cause large accuracy swings.

Selection bias is mainly driven by token-level priors on ID tokens, not just ordering.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Use all LLMs as judges: a fast, democratic way to rank models that matches human preference

Key finding

Judge with hidden states: use small models' internal vectors instead of prompting large LLMs

Key finding

DIBJUDGE: fine-tune judges to separate true quality signals from translation artifacts

Key finding

LLM judges favor 'new' and 'expert' labels but never admit it.

Key finding

Confuse the judge: a black-box method that labels LLM evaluations as high or low uncertainty

Key finding