Overview
The benchmark is large and systematic, but results focus on QA prompts and use specific prompting templates, so applicability beyond the studied setup requires extra validation.
Citations24
Evidence Strength0.80
Confidence0.82
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 30%
Production readiness: 30%
Novelty: 45%
Why It Matters For Business
Using LLMs as automatic scorers risks amplifying biases and diverging from human judgments, which can corrupt leaderboards, model selection, or downstream data labeling.
Who Should Care
Summary TLDR
The authors introduce COBBLER, a benchmark that tests six cognitive biases when LLMs act as pairwise evaluators on 50 QA prompts. They run 16 models (3B–175B+) across ~630k pairwise comparisons and human studies. Key findings: LLM evaluators show biased choices in many comparisons (≈40% overall), bandwagon and distraction prompts strongly shift model judgments (>70% for many models), and average agreement with human rankings is low (RBO ≈ 0.44). The paper concludes LLMs are not yet reliable replacements for human annotators.
Problem Statement
People increasingly use LLMs to judge text quality. But LLMs may amplify human-like cognitive biases and give unreliable rankings. The paper asks: how biased are LLMs when used as automatic evaluators, and how well do their rankings match humans?
Main Contribution
COBBLER: a bias benchmark testing six cognitive biases for LLMs used as pairwise evaluators in QA.
Large-scale evaluation: 16 popular LLMs (3B–175B+) on 50 QA prompts, producing ~630k pairwise evaluation samples.
Key Findings
LLMs show biased evaluation choices in a large fraction of comparisons
Average agreement between LLM rankings and human rankings is low
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Proportion of pairwise comparisons labeled biased (average across models) | ≈40% | RANDOM threshold per-bias | — | 50 QA instructions (ELI5 + strategyQA) | Abstract; Sec.1; Fig.2; Table2 | Abstract; Table2 |
| Average human–model agreement (RBO) | 0.44 | human–human average RBO = 0.54 | -0.10 vs human–human | N=13 ranking over 50 instructions | Sec.5.2; Fig.3; Table5 | Sec.5.2 |
What To Try In 7 Days
Run COBBLER or a subset on your planned LLM-evaluator to measure order, bandwagon, and distraction biases.
Add a quick human spot-check: sample ~100 LLM judgments and compute RBO against human labels.
Remove social-statistic-like text and irrelevant context from evaluation prompts and re-run a small test.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Study focuses on QA prompts (ELI5 and strategyQA); results may differ on other tasks.
Some models produced many invalid evaluations; conclusions apply to valid outputs only.
When Not To Use
As the sole evaluator for high-stakes or production labeling tasks without human oversight.
For tasks outside the QA-style prompts used in the paper without re-validating biases.
Failure Modes
Self-preference (egocentric): models favor their own outputs.
Order bias: favoring first or last shown option consistently.

