Overview
Simple, low-cost calibration steps improved agreement with humans on an 80-example benchmark, but the methods were validated on limited data and need broader testing.
Citations29
Evidence Strength0.70
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 45%
Why It Matters For Business
If you auto-grade or compare models with LLMs, order effects can flip results and mislead decisions; applying MEC+BPC and targeted human checks improves reliability and cuts annotation cost.
Who Should Care
Summary TLDR
The paper shows that using LLMs (GPT-4, ChatGPT) as automatic evaluators is biased by the presentation order of candidate answers. Swapping answer order often flips outcomes (high conflict rates). The authors propose three fixes: Multiple Evidence Calibration (ask the LLM to produce evidence before scoring and sample k answers), Balanced Position Calibration (average scores across swapped positions), and Human-in-the-Loop with a diversity score (BPDE) to select a small subset for human review. On an 80-question Vicuna benchmark, these methods raise alignment with human majority labels from ~53%/44% (GPT-4/ChatGPT vanilla) up to ~74%/71% with a 20% human cost, and cut annotation cost by ~39%.
Problem Statement
People often use LLMs as automatic judges of model answers. But these LLM-evaluators show strong positional bias: their scores and pairwise choices change when you swap the order of the two candidate responses. This makes comparisons unreliable and easy to manipulate.
Main Contribution
Empirical demonstration that GPT-4 and ChatGPT have strong positional bias when used as pairwise evaluators.
A lightweight calibration pipeline: Multiple Evidence Calibration (MEC), Balanced Position Calibration (BPC), and Human-in-the-Loop with BPDE.
Key Findings
LLM evaluators frequently conflict when candidate order is swapped.
Evaluators prefer a fixed slot: GPT-4 favors the first answer and ChatGPT favors the second.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Conflict Rate (order sensitivity) | GPT-4: 46.3% (Vicuna vs ChatGPT); GPT-4: 5.0% (Vicuna vs Alpaca) | — | — | Vicuna benchmark (80 examples) | Table 2 shows per-pair conflict rates when swapping assistant positions | Table 2 |
| Positional win-rate skew | Vicuna-13B win rate (GPT-4): 51.3% as Assistant1 vs 23.8% as Assistant2 | — | — | Vicuna benchmark (80 examples) | Table 2 demonstrates slot-dependent win rates | Table 2 |
What To Try In 7 Days
Run a swap-order sensitivity test: compare evaluator outputs after swapping candidate positions.
Add evidence-first prompts (MEC): ask the LLM to list reasons then score; sample k=3.
Apply Balanced Position Calibration (BPC): evaluate both orders and average scores before deciding a winner.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Evaluation is on 80 Vicuna questions; results may not generalize to all datasets or languages.
Human annotations were done by three authors, risking subtle bias in the gold labels.
When Not To Use
When you already have full, reliable human evaluation for all examples.
When candidate answers differ vastly (large score gap) because positional bias matters less then.
Failure Modes
High positional sensitivity remains when score gaps are small; calibration reduces but may not eliminate flips.
Too large k or wrong sampling temperature can reduce MEC benefits and increase cost.

