Use strong LLMs (e.g., GPT-4) as scalable judges for human preference with checks for bias and math errors

June 9, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

433

Authors

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica

Links

Abstract / PDF

Why It Matters For Business

High-quality LLMs (e.g., GPT-4) can automate preference labeling at ~80–85% human agreement, drastically cutting the time and cost of human evaluations for product iterations while remaining explainable.

Summary TLDR

The paper builds two human-preference benchmarks (MT-bench and Chatbot Arena) and evaluates whether large LLMs can act as automated judges. High-end LLMs like GPT-4 match expert and crowd human preferences at about 80–85% agreement on non-tied votes, making them a scalable surrogate for preference labeling. However, automated judges show predictable failure modes—position bias, verbosity bias, occasional self-preference, and weak grading on some math/reasoning—each of which can be mitigated with practical fixes (swap positions, few-shot examples, reference-guided or CoT prompts, or fine-tuning an open model). The paper releases 80 MT-bench prompts, 3K expert votes, and ~30K arena votes.

Problem Statement

Standard benchmarks miss the open-ended, multi-turn, instruction-following behaviors that determine human preference for chat assistants. Human labeling is expensive. Can strong LLMs reliably judge which chatbot replies humans prefer, and what are their failure modes?

Main Contribution

Two human-preference resources: MT-bench (80 multi-turn questions) and Chatbot Arena (30K crowd votes).

A systematic study of the 'LLM-as-a-judge' idea, showing GPT-4 closely matches human preferences under safeguards.

Practical mitigations for judge failure modes: swap positions, few-shot prompts, chain-of-thought and reference-guided grading, and fine-tuning an open model judge (Vicuna).

Key Findings

GPT-4 judgments align with human experts on non-tied MT-bench votes.

Numbers85% agreement (MT-bench non-tie, Table 5)

Position bias causes inconsistent judgments; swapping answers or few-shot prompts reduces it.

NumbersGPT-4 consistency ~65% (zero-shot) → 77.5% (few-shot), Table 12

Verbosity (repetitive-list) attack fools many judges but not GPT-4.

NumbersFailure rate: Claude-v1 91.3%, GPT-3.5 91.3%, GPT-4 8.7% (Table 3)

Grading math/reasoning can fail unless guided by references or internal reasoning.

NumbersMath judge failure: default 14/20 (70%), CoT 6/20 (30%), reference 3/20 (15%) (Table 4)

An open model fine-tuned on human votes becomes a usable, cheaper judge.

NumbersVicuna consistency improved from ~16% (zero-shot) to 65% (fine-tuned); agreement without ties 85.5% (Appendix F)

Results

GPT-4 agreement with humans (MT-bench, non-tie)

Value85%

Baselinehuman-human 81% (experts)

Judge failure under 'repetitive list' verbosity attack

ValueGPT-4 8.7%; GPT-3.5 91.3%; Claude-v1 91.3%

Baselineideal 0%

Math grading failure rates under prompts

ValueDefault 14/20 (70%); CoT 6/20 (30%); Reference 3/20 (15%)

BaselineReference-guided best

MT-bench score (GPT-4 judge, scale 0–10 per turn aggregated)

ValueGPT-4: 8.99; GPT-3.5: 7.94; Vicuna-13B: 6.39

BaselineMT-bench scoring rubric

Vicuna judge fine-tune agreement (excluding ties)

Value85.5%

Baselinerandom 50%

Who Should Care

What To Try In 7 Days

Run GPT-4 single-answer grading on a small sample of your chat outputs and compare with past human votes.

Implement position-swapping: auto-evaluate each pair twice with swapped order and treat mismatches as ties.

Add reference-guided grading for math-like or fact-critical prompts, or use CoT prompts for reasoning items.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Paper focuses on helpfulness (preference) and does not evaluate honesty or safety metrics.
  • Single combined helpfulness score hides components like accuracy, relevance, and creativity.
  • LLM judges can be biased (position, verbosity, self-preference) and err on math/reasoning without references.
  • Top-performing judges tested are proprietary (cost and access constraints).

When Not To Use

  • When absolute factual correctness is required and ground-truth exists (use reference-based evaluation).
  • For safety or harmfulness audits without retooling prompts to measure honesty/harmlessness.
  • When small performance differences must be measured without human confirmation.
  • Where regulatory or legal decisions require human ratification.

Failure Modes

  • Position bias: judge prefers first-listed answer unless swapped or randomized.
  • Verbosity bias: longer but repetitive answers can be judged better by weaker judges.
  • Self-enhancement bias: judges may favor their own model outputs slightly.
  • Math/reasoning misgrading: judges can be misled by incorrect candidate answers.
  • Template/output parsing errors for weaker open models (error-format outputs).

Core Entities

Models

  • GPT-4
  • GPT-3.5
  • Claude-v1
  • Vicuna-13B
  • Vicuna-7B
  • LLaMA-13B
  • Alpaca-13B
  • Koala-13B
  • Dolly-12B

Metrics

  • agreement
  • consistency (position swap)
  • failure rate (attack)
  • win rate
  • MT-bench score

Datasets

  • MT-bench
  • Chatbot Arena
  • ShareGPT

Benchmarks

  • MMLU
  • TruthfulQA

Context Entities

Models

  • Vicuna-13B-Fine-Tune (judge)
  • Vicuna-13B-Zero-Shot

Metrics

  • pairwise agreement
  • single-answer grading agreement
  • position bias consistency
  • verbosity attack failure rate

Datasets

  • ShareGPT cleaned conversations
  • 3K expert votes
  • 30K arena votes

Benchmarks

  • MT-bench (80 multi-turn)
  • Chatbot Arena (crowdsourced battles)