Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.8
Citation Count
9
Why It Matters For Business
You can cut automatic evaluation cost by ~7–8x and get evaluations that align better with humans by pooling several smaller, different LLMs instead of calling one expensive judge like GPT-4.
Summary TLDR
The paper proposes PoLL (Panel of LLM Evaluators): score model outputs by pooling judgments from multiple smaller, diverse LLMs instead of using a single large judge (e.g., GPT-4). Across single-hop QA, multi-hop QA, and Chatbot Arena hard prompts, PoLL correlates better with human judgments, reduces per-judge bias, and costs 7–8x less than a single GPT-4 judge. Prompt wording still matters for individual judges (GPT-4 is prompt-sensitive). The method is simple to implement and best used when you need scalable, lower-cost automatic evaluation with reduced evaluator bias.
Problem Statement
Automatic evaluation of open-ended LLM outputs is hard. People increasingly use a single large LLM (like GPT-4) as the judge, but that is costly and can be biased toward its own outputs. The paper asks: can a small, heterogeneous panel of evaluators match or exceed a single big judge while cutting cost and reducing bias?
Main Contribution
Propose PoLL: aggregate independent scores from a small, diverse set of evaluator LLMs via pooling (max or average).
Show PoLL correlates better with human judgments than a single GPT-4 judge across multiple QA and chatbot benchmarks.
Demonstrate large cost and variance benefits: PoLL is ~7–8x cheaper and has lower judge-score spread than single-model judging.
Show GPT-4 judge performance is sensitive to prompt design; a simple 'don't overthink' prompt improved agreement.
Key Findings
PoLL achieves higher agreement with humans than single large judges on KILT single-hop QA.
PoLL ranks models closer to human leaderboard order on Chatbot Arena Hard.
PoLL is much cheaper than a single GPT-4 judge for the evaluated setup.
Pooling reduces evaluator variance and self-preference bias.
Individual judge performance is prompt-sensitive (GPT-4 example).
Results
Agreement with humans (Cohen's κ)
Rank correlation with human leaderboard
Judge score spread (stability)
Cost per evaluation (input+output token pricing)
Prompt sensitivity (GPT-4 kappa on NQ)
Arena Hard scores (PoLL, average pooling)
Who Should Care
What To Try In 7 Days
Run a 3-model PoLL (e.g., CommandR, Haiku, GPT-3.5) on a held-out validation set and compare Cohen's κ vs your current single-judge results.
Use max pooling for binary correctness tasks and average pooling for 1–5 style ratings; test both.
Tune judge prompts: add few-shot examples and a short 'don't overthink' instruction to reduce over-reasoning.
Optimization Features
Token Efficiency
- Prompt tuning of judges reduces unnecessary reasoning tokens
System Optimization
- Parallel inference across smaller models can be faster than a single large model
Inference Optimization
- Use smaller models in parallel to lower latency and cost
- Pool outputs instead of scaling a single judge
Reproducibility
Data Urls
- KILT
- Chatbot Arena (arena-hard repo referenced)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments use a limited panel composition (three judges) and model families; generality to other panels is unproven.
- Evaluations focus on QA and Chatbot Arena hard prompts; results may not hold for math or deep reasoning tasks.
- Some judge performance depends heavily on prompt design, so PoLL needs careful prompt engineering per task.
When Not To Use
- For tightly calibrated scientific or math problems where evaluator factual reasoning must be authoritative.
- If you cannot run multiple models in parallel due to infra limits.
- When legal or compliance constraints require a single auditable scorer.
Failure Modes
- Panel members share a blind spot (all mis-evaluate a class of errors), producing a confidently wrong pooled score.
- Self-preference persists if panel contains many models from a single family.
- Poor prompt design for judges leads to arbitrary variance in scores.
Core Entities
Models
- GPT-4
- GPT-3.5
- CommandR (CMD-R)
- CommandR+ (CMD-R+)
- Haiku
- Sonnet
- Opus
- Mistral-LG
- Mistral-MD
Metrics
- Cohen's kappa
- Pearson correlation
- Kendall Tau
- Exact Match (containment EM)
- ELO/ranking correlation
Datasets
- KILT-NaturalQuestions (NQ)
- TriviaQA (TQA)
- HotpotQA (HPQA)
- Bamboogle
- Chatbot Arena Hard
Benchmarks
- KILT
- Chatbot Arena

