Use strong LLMs (e.g., GPT-4) as scalable judges for human preference with checks for bias and math errors

June 9, 20238 min

Overview

Decision SnapshotNeeds Validation

The paper gives strong empirical evidence (expert and crowd votes) that GPT-4 approximates human preference judgments, but recommends simple defenses and human checks for known failure modes.

Citations433

Evidence Strength0.80

Confidence0.85

Risk Signals13

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica

Links

Abstract / PDF / Code / Data

Why It Matters For Business

High-quality LLMs (e.g., GPT-4) can automate preference labeling at ~80–85% human agreement, drastically cutting the time and cost of human evaluations for product iterations while remaining explainable.

Who Should Care

Summary TLDR

The paper builds two human-preference benchmarks (MT-bench and Chatbot Arena) and evaluates whether large LLMs can act as automated judges. High-end LLMs like GPT-4 match expert and crowd human preferences at about 80–85% agreement on non-tied votes, making them a scalable surrogate for preference labeling. However, automated judges show predictable failure modes—position bias, verbosity bias, occasional self-preference, and weak grading on some math/reasoning—each of which can be mitigated with practical fixes (swap positions, few-shot examples, reference-guided or CoT prompts, or fine-tuning an open model). The paper releases 80 MT-bench prompts, 3K expert votes, and ~30K arena votes.

Problem Statement

Standard benchmarks miss the open-ended, multi-turn, instruction-following behaviors that determine human preference for chat assistants. Human labeling is expensive. Can strong LLMs reliably judge which chatbot replies humans prefer, and what are their failure modes?

Main Contribution

Two human-preference resources: MT-bench (80 multi-turn questions) and Chatbot Arena (30K crowd votes).

A systematic study of the 'LLM-as-a-judge' idea, showing GPT-4 closely matches human preferences under safeguards.

Key Findings

GPT-4 judgments align with human experts on non-tied MT-bench votes.

Numbers85% agreement (MT-bench non-tie, Table 5)

Practical UseYou can use GPT-4 to approximate expert preference labels at scale, but keep human checks for edge cases.

Evidence RefTable 5

Position bias causes inconsistent judgments; swapping answers or few-shot prompts reduces it.

NumbersGPT-4 consistency ~65% (zero-shot) → 77.5% (few-shot), Table 12

Practical UseRandomize or swap answer order and treat inconsistent pairs as ties to avoid false wins.

Evidence RefTable 12

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GPT-4 agreement with humans (MT-bench, non-tie)85%human-human 81% (experts)+4 ppMT-bench (non-tied votes)Table 5 shows 85% agreement between GPT-4 single-answer grading and humans on non-tie votesTable 5
Judge failure under 'repetitive list' verbosity attackGPT-4 8.7%; GPT-3.5 91.3%; Claude-v1 91.3%ideal 0%large difference vs other judges23 list-containing answers from MT-benchTable 3 shows failure rates for repetitive-list attackTable 3

What To Try In 7 Days

Run GPT-4 single-answer grading on a small sample of your chat outputs and compare with past human votes.

Implement position-swapping: auto-evaluate each pair twice with swapped order and treat mismatches as ties.

Add reference-guided grading for math-like or fact-critical prompts, or use CoT prompts for reasoning items.

Reproducibility

Risks & Boundaries

Limitations

Paper focuses on helpfulness (preference) and does not evaluate honesty or safety metrics.

Single combined helpfulness score hides components like accuracy, relevance, and creativity.

When Not To Use

When absolute factual correctness is required and ground-truth exists (use reference-based evaluation).

For safety or harmfulness audits without retooling prompts to measure honesty/harmlessness.

Failure Modes

Position bias: judge prefers first-listed answer unless swapped or randomized.

Verbosity bias: longer but repetitive answers can be judged better by weaker judges.

Core Entities

Models

GPT-4GPT-3.5Claude-v1Vicuna-13BVicuna-7BLLaMA-13BAlpaca-13BKoala-13BDolly-12B

Metrics

agreementconsistency (position swap)failure rate (attack)win rateMT-bench score

Datasets

MT-benchChatbot ArenaShareGPT

Benchmarks

MMLUTruthfulQA

Context Entities

Models

Vicuna-13B-Fine-Tune (judge)Vicuna-13B-Zero-Shot

Metrics

pairwise agreementsingle-answer grading agreementposition bias consistencyverbosity attack failure rate

Datasets

ShareGPT cleaned conversations3K expert votes30K arena votes

Benchmarks

MT-bench (80 multi-turn)Chatbot Arena (crowdsourced battles)