Replace one big LLM judge with a panel of smaller, diverse LLMs to get cheaper, less biased, and more human-aligned evaluation

April 29, 20248 min

Overview

Decision SnapshotReady For Pilot

Empirical results across multiple QA and chat benchmarks show PoLL improves human alignment and lowers cost, but experiments are limited to a few panel choices and task types.

Citations9

Evidence Strength0.80

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 50%

Authors

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, Patrick Lewis

Links

Abstract / PDF / Data

Why It Matters For Business

You can cut automatic evaluation cost by ~7–8x and get evaluations that align better with humans by pooling several smaller, different LLMs instead of calling one expensive judge like GPT-4.

Who Should Care

Summary TLDR

The paper proposes PoLL (Panel of LLM Evaluators): score model outputs by pooling judgments from multiple smaller, diverse LLMs instead of using a single large judge (e.g., GPT-4). Across single-hop QA, multi-hop QA, and Chatbot Arena hard prompts, PoLL correlates better with human judgments, reduces per-judge bias, and costs 7–8x less than a single GPT-4 judge. Prompt wording still matters for individual judges (GPT-4 is prompt-sensitive). The method is simple to implement and best used when you need scalable, lower-cost automatic evaluation with reduced evaluator bias.

Problem Statement

Automatic evaluation of open-ended LLM outputs is hard. People increasingly use a single large LLM (like GPT-4) as the judge, but that is costly and can be biased toward its own outputs. The paper asks: can a small, heterogeneous panel of evaluators match or exceed a single big judge while cutting cost and reducing bias?

Main Contribution

Propose PoLL: aggregate independent scores from a small, diverse set of evaluator LLMs via pooling (max or average).

Show PoLL correlates better with human judgments than a single GPT-4 judge across multiple QA and chatbot benchmarks.

Key Findings

PoLL achieves higher agreement with humans than single large judges on KILT single-hop QA.

NumbersCohen's κ on NQ/TQA/HPQA: PoLL 0.763/0.906/0.867 vs GPT-4 0.627/0.841/0.83

Practical UseUse a small panel of diverse LLMs to score QA outputs for closer alignment to human labels, instead of relying on one big judge.

Evidence RefTable 1

PoLL ranks models closer to human leaderboard order on Chatbot Arena Hard.

NumbersPearson/Kendall: PoLL 0.917 / 0.778 vs GPT-4 0.817 / 0.667

Practical UseFor model-to-model ranking tasks, aggregate multiple smaller judges to get rankings that better match human evaluations.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Agreement with humans (Cohen's κ)PoLL NQ/TQA/HPQA: 0.763 / 0.906 / 0.867GPT-4 NQ/TQA/HPQA: 0.627 / 0.841 / 0.83PoLL +0.136 / +0.065 / +0.037KILT single-hop QATable 1; Section 4.1Table 1
Rank correlation with human leaderboardPoLL Pearson/Kendall: 0.917 / 0.778GPT-4 Pearson/Kendall: 0.817 / 0.667+0.100 / +0.111Chatbot Arena HardTable 2; Section 4.2Table 2

What To Try In 7 Days

Run a 3-model PoLL (e.g., CommandR, Haiku, GPT-3.5) on a held-out validation set and compare Cohen's κ vs your current single-judge results.

Use max pooling for binary correctness tasks and average pooling for 1–5 style ratings; test both.

Tune judge prompts: add few-shot examples and a short 'don't overthink' instruction to reduce over-reasoning.

Optimization Features

Token Efficiency
Prompt tuning of judges reduces unnecessary reasoning tokens
System Optimization
Parallel inference across smaller models can be faster than a single large model
Inference Optimization
Use smaller models in parallel to lower latency and costPool outputs instead of scaling a single judge

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

KILTChatbot Arena (arena-hard repo referenced)

Risks & Boundaries

Limitations

Experiments use a limited panel composition (three judges) and model families; generality to other panels is unproven.

Evaluations focus on QA and Chatbot Arena hard prompts; results may not hold for math or deep reasoning tasks.

When Not To Use

For tightly calibrated scientific or math problems where evaluator factual reasoning must be authoritative.

If you cannot run multiple models in parallel due to infra limits.

Failure Modes

Panel members share a blind spot (all mis-evaluate a class of errors), producing a confidently wrong pooled score.

Self-preference persists if panel contains many models from a single family.

Core Entities

Models

GPT-4GPT-3.5CommandR (CMD-R)CommandR+ (CMD-R+)HaikuSonnetOpusMistral-LGMistral-MD

Metrics

Cohen's kappaPearson correlationKendall TauExact Match (containment EM)ELO/ranking correlation

Datasets

KILT-NaturalQuestions (NQ)TriviaQA (TQA)HotpotQA (HPQA)BamboogleChatbot Arena Hard

Benchmarks

KILTChatbot Arena