LLM graders prefer an answer's position — simple calibration and a little human help fix it

May 29, 20237 min

Overview

Decision SnapshotNeeds Validation

Simple, low-cost calibration steps improved agreement with humans on an 80-example benchmark, but the methods were validated on limited data and need broader testing.

Citations29

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 45%

Authors

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, Zhifang Sui

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you auto-grade or compare models with LLMs, order effects can flip results and mislead decisions; applying MEC+BPC and targeted human checks improves reliability and cuts annotation cost.

Who Should Care

Summary TLDR

The paper shows that using LLMs (GPT-4, ChatGPT) as automatic evaluators is biased by the presentation order of candidate answers. Swapping answer order often flips outcomes (high conflict rates). The authors propose three fixes: Multiple Evidence Calibration (ask the LLM to produce evidence before scoring and sample k answers), Balanced Position Calibration (average scores across swapped positions), and Human-in-the-Loop with a diversity score (BPDE) to select a small subset for human review. On an 80-question Vicuna benchmark, these methods raise alignment with human majority labels from ~53%/44% (GPT-4/ChatGPT vanilla) up to ~74%/71% with a 20% human cost, and cut annotation cost by ~39%.

Problem Statement

People often use LLMs as automatic judges of model answers. But these LLM-evaluators show strong positional bias: their scores and pairwise choices change when you swap the order of the two candidate responses. This makes comparisons unreliable and easy to manipulate.

Main Contribution

Empirical demonstration that GPT-4 and ChatGPT have strong positional bias when used as pairwise evaluators.

A lightweight calibration pipeline: Multiple Evidence Calibration (MEC), Balanced Position Calibration (BPC), and Human-in-the-Loop with BPDE.

Key Findings

LLM evaluators frequently conflict when candidate order is swapped.

NumbersGPT-4 conflict rate 46.3% (Vicuna vs ChatGPT); ChatGPT 82.5% (Table 2)

Practical UseDon't trust a single run with fixed answer order—test by swapping or average across orders.

Evidence RefTable 2

Evaluators prefer a fixed slot: GPT-4 favors the first answer and ChatGPT favors the second.

NumbersVicuna-13B win rates (GPT-4): 51.3% vs 23.8%; (ChatGPT): 2.5% vs 82.5% (Table 2)

Practical UseWhen comparing models, average scores across positions (BPC) to remove position advantage.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Conflict Rate (order sensitivity)GPT-4: 46.3% (Vicuna vs ChatGPT); GPT-4: 5.0% (Vicuna vs Alpaca)Vicuna benchmark (80 examples)Table 2 shows per-pair conflict rates when swapping assistant positionsTable 2
Positional win-rate skewVicuna-13B win rate (GPT-4): 51.3% as Assistant1 vs 23.8% as Assistant2Vicuna benchmark (80 examples)Table 2 demonstrates slot-dependent win ratesTable 2

What To Try In 7 Days

Run a swap-order sensitivity test: compare evaluator outputs after swapping candidate positions.

Add evidence-first prompts (MEC): ask the LLM to list reasons then score; sample k=3.

Apply Balanced Position Calibration (BPC): evaluate both orders and average scores before deciding a winner.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation is on 80 Vicuna questions; results may not generalize to all datasets or languages.

Human annotations were done by three authors, risking subtle bias in the gold labels.

When Not To Use

When you already have full, reliable human evaluation for all examples.

When candidate answers differ vastly (large score gap) because positional bias matters less then.

Failure Modes

High positional sensitivity remains when score gaps are small; calibration reduces but may not eliminate flips.

Too large k or wrong sampling temperature can reduce MEC benefits and increase cost.

Core Entities

Models

GPT-4ChatGPTVicuna-13BAlpaca-13B

Metrics

Accuracykappaconflict rateBPDE (Balanced Position Diversity Entropy)

Datasets

Vicuna Benchmark (80 questions)

Benchmarks

Vicuna evaluation pipeline