COBBLER shows many LLMs are biased evaluators and disagree with humans

September 29, 20237 min

Overview

Decision SnapshotNeeds Validation

The benchmark is large and systematic, but results focus on QA prompts and use specific prompting templates, so applicability beyond the studied setup requires extra validation.

Citations24

Evidence Strength0.80

Confidence0.82

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 30%

Novelty: 45%

Authors

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, Dongyeop Kang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Using LLMs as automatic scorers risks amplifying biases and diverging from human judgments, which can corrupt leaderboards, model selection, or downstream data labeling.

Who Should Care

Summary TLDR

The authors introduce COBBLER, a benchmark that tests six cognitive biases when LLMs act as pairwise evaluators on 50 QA prompts. They run 16 models (3B–175B+) across ~630k pairwise comparisons and human studies. Key findings: LLM evaluators show biased choices in many comparisons (≈40% overall), bandwagon and distraction prompts strongly shift model judgments (>70% for many models), and average agreement with human rankings is low (RBO ≈ 0.44). The paper concludes LLMs are not yet reliable replacements for human annotators.

Problem Statement

People increasingly use LLMs to judge text quality. But LLMs may amplify human-like cognitive biases and give unreliable rankings. The paper asks: how biased are LLMs when used as automatic evaluators, and how well do their rankings match humans?

Main Contribution

COBBLER: a bias benchmark testing six cognitive biases for LLMs used as pairwise evaluators in QA.

Large-scale evaluation: 16 popular LLMs (3B–175B+) on 50 QA prompts, producing ~630k pairwise evaluation samples.

Key Findings

LLMs show biased evaluation choices in a large fraction of comparisons

Numbers≈40% of comparisons across models were labeled biased

Practical UseDo not trust raw LLM pairwise judgments as unbiased; add human checks or bias tests before auto-labeling.

Evidence RefAbstract; Sec.1; Fig.2; Table2

Average agreement between LLM rankings and human rankings is low

NumbersAverage RBO = 0.44 between model and human rankings

Practical UseIf you need human-aligned evaluation, validate LLM judges on human data and expect sizable mismatch.

Evidence RefAbstract; Sec.5.2; Fig.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Proportion of pairwise comparisons labeled biased (average across models)≈40%RANDOM threshold per-bias50 QA instructions (ELI5 + strategyQA)Abstract; Sec.1; Fig.2; Table2Abstract; Table2
Average human–model agreement (RBO)0.44human–human average RBO = 0.54-0.10 vs human–humanN=13 ranking over 50 instructionsSec.5.2; Fig.3; Table5Sec.5.2

What To Try In 7 Days

Run COBBLER or a subset on your planned LLM-evaluator to measure order, bandwagon, and distraction biases.

Add a quick human spot-check: sample ~100 LLM judgments and compute RBO against human labels.

Remove social-statistic-like text and irrelevant context from evaluation prompts and re-run a small test.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

ELI5BIG-Bench (strategyQA)

Risks & Boundaries

Limitations

Study focuses on QA prompts (ELI5 and strategyQA); results may differ on other tasks.

Some models produced many invalid evaluations; conclusions apply to valid outputs only.

When Not To Use

As the sole evaluator for high-stakes or production labeling tasks without human oversight.

For tasks outside the QA-style prompts used in the paper without re-validating biases.

Failure Modes

Self-preference (egocentric): models favor their own outputs.

Order bias: favoring first or last shown option consistently.

Core Entities

Models

GPT-4ChatGPTInstructGPTLlama2LlamaCohereFalconAlpacaVicunaOpenAssistantMistralOlmoBaizeKoalaWizardLMMPT

Metrics

Rank-Biased Overlap (RBO)BERTScorevalid response rateproportion of biased evaluations

Datasets

ELI5BIG-Bench (strategyQA)

Benchmarks

COBBLER (this paper)