Use many LLM ‘reviewers’ plus one round of discussion to get fairer, cheaper human-aligned evaluations

Overview

Decision SnapshotNeeds Validation

The method shows consistent small-to-moderate gains on three benchmarks and a clear cost/time win in one Chinese labeling case; however results rely on chosen LLMs, prompt design, and dataset composition.

Citations17

Evidence Strength0.65

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Xinghua Zhang, Bowen Yu, Haiyang Yu, Yangyu Lv, Tingwen Liu, Fei Huang, Hongbo Xu, Yongbin Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

WideDeep can cut manual labeling time and cost by pre-labeling outputs with higher human agreement, so teams can scale human evaluation faster and cheaper while keeping quality checks.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

The authors propose WideDeep: treat multiple frozen LLM calls as neurons with automatically generated evaluation 'roles', stack them into a 2-layer network (one discussion round), and aggregate via voting/averaging. On a new, diverse LLMEval 2 benchmark (2,553 pairwise samples, 15 tasks, 8 abilities) WideDeep outperforms prior single-layer ensembling (FairEval) by a few accuracy points and raises kappa agreement. In Chinese LLM labeling, WideDeep cut human checking time ~4.6× and lowered per-sample cost ~60%, while reaching 74% labeling accuracy and high human agreement.

Problem Statement

LLMs can judge generated text, but single-layer ensembles of LLM evaluations are biased (position, verbosity) and unstable. Prior benchmarks for evaluating LLM evaluators are small or narrow. The paper asks whether making the LLM-evaluator network wider (more independent LLM 'neurons') and slightly deeper (one extra integration round) yields fairer, more stable human-aligned judgments.

Main Contribution

WideDeep: a two-layer, wide LLM evaluator that generates per-sample evaluation perspectives (neuron roles), runs multiple independent LLM evaluations, then integrates them in a second layer.

LLMEval 2: a new diverse evaluation benchmark with 2,553 pairwise samples covering 15 tasks and 8 evaluation abilities.

Key Findings

A two-layer wide LLM network (WideDeep) raises inter-annotator kappa on LLMEval 2 compared to prior baseline.

Numberskappa 0.2807 -> 0.3440 (Δ≈+0.0633) on LLMEval 2, Table 1

Practical UseIf you need more agreement with humans on open-ended pairwise judgments, run a 2-layer WideDeep ensemble instead of a one-layer ensemble; expect modest but consistent kappa gains on similar benchmarks.

Evidence RefTable 1 (LLMEval 2 Kap.)

WideDeep improves accuracy over FairEval on multiple benchmarks.

NumbersLLMEval 2 acc 0.5735 -> 0.6036 (+3.01 pts); PandaLM acc 0.7147 -> 0.7568 (+4.21 pts), Table 1

Practical UseSwitching to WideDeep can raise ranking accuracy a few percentage points on evaluated tasks; useful for automated pre-labeling or quick A/B of model outputs.

Evidence RefTable 1 (Acc results)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
kappa (LLMEval 2)	0.3440 (WideDeep c*2 all)	0.2807 (FairEval)	+0.0633	LLMEval 2	Table 1 reports FairEval Kap.=0.2807, WideDeep c*2 (all)=0.3440	Table 1
Accuracy	0.6036 (WideDeep c*2 all)	0.5735 (FairEval)	+0.0301	LLMEval 2	Table 1 shows accuracy rise from 0.5735 to 0.6036	Table 1

What To Try In 7 Days

Run LLMEval2 mini (300 samples) to benchmark your current evaluator vs WideDeep.

Implement per-sample role-generation prompts and ensemble 5–15 gpt-3.5-turbo calls, then aggregate by voting.

Replace full cross-annotation with sample checks: measure percent of predictions to inspect and estimate savings.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/WideDeep

Data URLs

https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/WideDeep

Risks & Boundaries

Limitations

Performance gains are moderate and shown on the provided benchmarks; real gains depend on your tasks and LLM quality.

Deeper stacks (more than 2 layers) decreased performance in their tests.

When Not To Use

When LLM API costs prevent multiple independent calls per sample.

When ground-truth labels are objective single answers (not subjective preferences).

Failure Modes

If neuron roles converge to similar perspectives, added width gives little benefit.

Position or verbosity bias may persist if order-robustness checks are not applied.

Core Entities

Models

gpt-3.5-turbogpt-4vicuna-13bllama-7bbloom-7bcerebras-gpt-6.7bopt-7bpythia-6.9b

Metrics

AccuracyMacro-F1kappa correlation coefficient

Datasets

LLMEval 2LLMEval2 miniFairEvalPandaLM

Benchmarks

LLMEval 2FairEvalPandaLM

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

A two-layer wide LLM network (WideDeep) raises inter-annotator kappa on LLMEval 2 compared to prior baseline.

WideDeep improves accuracy over FairEval on multiple benchmarks.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding