Replace one big LLM judge with a panel of smaller, diverse LLMs to get cheaper, less biased, and more human-aligned evaluation

April 29, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.8

Citation Count

9

Authors

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, Patrick Lewis

Links

Abstract / PDF

Why It Matters For Business

You can cut automatic evaluation cost by ~7–8x and get evaluations that align better with humans by pooling several smaller, different LLMs instead of calling one expensive judge like GPT-4.

Summary TLDR

The paper proposes PoLL (Panel of LLM Evaluators): score model outputs by pooling judgments from multiple smaller, diverse LLMs instead of using a single large judge (e.g., GPT-4). Across single-hop QA, multi-hop QA, and Chatbot Arena hard prompts, PoLL correlates better with human judgments, reduces per-judge bias, and costs 7–8x less than a single GPT-4 judge. Prompt wording still matters for individual judges (GPT-4 is prompt-sensitive). The method is simple to implement and best used when you need scalable, lower-cost automatic evaluation with reduced evaluator bias.

Problem Statement

Automatic evaluation of open-ended LLM outputs is hard. People increasingly use a single large LLM (like GPT-4) as the judge, but that is costly and can be biased toward its own outputs. The paper asks: can a small, heterogeneous panel of evaluators match or exceed a single big judge while cutting cost and reducing bias?

Main Contribution

Propose PoLL: aggregate independent scores from a small, diverse set of evaluator LLMs via pooling (max or average).

Show PoLL correlates better with human judgments than a single GPT-4 judge across multiple QA and chatbot benchmarks.

Demonstrate large cost and variance benefits: PoLL is ~7–8x cheaper and has lower judge-score spread than single-model judging.

Show GPT-4 judge performance is sensitive to prompt design; a simple 'don't overthink' prompt improved agreement.

Key Findings

PoLL achieves higher agreement with humans than single large judges on KILT single-hop QA.

NumbersCohen's κ on NQ/TQA/HPQA: PoLL 0.763/0.906/0.867 vs GPT-4 0.627/0.841/0.83

PoLL ranks models closer to human leaderboard order on Chatbot Arena Hard.

NumbersPearson/Kendall: PoLL 0.917 / 0.778 vs GPT-4 0.817 / 0.667

PoLL is much cheaper than a single GPT-4 judge for the evaluated setup.

NumbersPoLL cost per run ≈ $1.25 input + $4.25 output vs GPT-4 Turbo ≈ $10 input + $30 output (7–8x cheaper overall)

Pooling reduces evaluator variance and self-preference bias.

NumbersJudge-score spread SD: PoLL 2.2 vs GPT-3.5 6.1 (lower spread observed vs individual judges)

Individual judge performance is prompt-sensitive (GPT-4 example).

NumbersGPT-4 κ on NQ varies: zero-shot 0.518 -> few-shot 0.627 -> 'don't overthink' 0.725

Results

Agreement with humans (Cohen's κ)

ValuePoLL NQ/TQA/HPQA: 0.763 / 0.906 / 0.867

BaselineGPT-4 NQ/TQA/HPQA: 0.627 / 0.841 / 0.83

Rank correlation with human leaderboard

ValuePoLL Pearson/Kendall: 0.917 / 0.778

BaselineGPT-4 Pearson/Kendall: 0.817 / 0.667

Judge score spread (stability)

ValuePoLL SD = 2.2

BaselineGPT-3.5 SD = 6.1

Cost per evaluation (input+output token pricing)

ValuePoLL ≈ $1.25 input + $4.25 output

BaselineGPT-4 Turbo ≈ $10 input + $30 output

Prompt sensitivity (GPT-4 kappa on NQ)

ValueZero-shot 0.518 -> Few-shot 0.627 -> 'don't overthink' 0.725

BaselineGPT-3.5 few-shot 0.726

Arena Hard scores (PoLL, average pooling)

ValueGPT-4-turbo 68.7; Sonnet 57.6; CMD-R+ 57.1; Haiku 55.9

Who Should Care

What To Try In 7 Days

Run a 3-model PoLL (e.g., CommandR, Haiku, GPT-3.5) on a held-out validation set and compare Cohen's κ vs your current single-judge results.

Use max pooling for binary correctness tasks and average pooling for 1–5 style ratings; test both.

Tune judge prompts: add few-shot examples and a short 'don't overthink' instruction to reduce over-reasoning.

Optimization Features

Token Efficiency

  • Prompt tuning of judges reduces unnecessary reasoning tokens

System Optimization

  • Parallel inference across smaller models can be faster than a single large model

Inference Optimization

  • Use smaller models in parallel to lower latency and cost
  • Pool outputs instead of scaling a single judge

Reproducibility

Data Urls

  • KILT
  • Chatbot Arena (arena-hard repo referenced)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments use a limited panel composition (three judges) and model families; generality to other panels is unproven.
  • Evaluations focus on QA and Chatbot Arena hard prompts; results may not hold for math or deep reasoning tasks.
  • Some judge performance depends heavily on prompt design, so PoLL needs careful prompt engineering per task.

When Not To Use

  • For tightly calibrated scientific or math problems where evaluator factual reasoning must be authoritative.
  • If you cannot run multiple models in parallel due to infra limits.
  • When legal or compliance constraints require a single auditable scorer.

Failure Modes

  • Panel members share a blind spot (all mis-evaluate a class of errors), producing a confidently wrong pooled score.
  • Self-preference persists if panel contains many models from a single family.
  • Poor prompt design for judges leads to arbitrary variance in scores.

Core Entities

Models

  • GPT-4
  • GPT-3.5
  • CommandR (CMD-R)
  • CommandR+ (CMD-R+)
  • Haiku
  • Sonnet
  • Opus
  • Mistral-LG
  • Mistral-MD

Metrics

  • Cohen's kappa
  • Pearson correlation
  • Kendall Tau
  • Exact Match (containment EM)
  • ELO/ranking correlation

Datasets

  • KILT-NaturalQuestions (NQ)
  • TriviaQA (TQA)
  • HotpotQA (HPQA)
  • Bamboogle
  • Chatbot Arena Hard

Benchmarks

  • KILT
  • Chatbot Arena