Use many LLM ‘reviewers’ plus one round of discussion to get fairer, cheaper human-aligned evaluations

August 3, 20238 min

Overview

Decision SnapshotNeeds Validation

The method shows consistent small-to-moderate gains on three benchmarks and a clear cost/time win in one Chinese labeling case; however results rely on chosen LLMs, prompt design, and dataset composition.

Citations17

Evidence Strength0.65

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Xinghua Zhang, Bowen Yu, Haiyang Yu, Yangyu Lv, Tingwen Liu, Fei Huang, Hongbo Xu, Yongbin Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

WideDeep can cut manual labeling time and cost by pre-labeling outputs with higher human agreement, so teams can scale human evaluation faster and cheaper while keeping quality checks.

Who Should Care

Summary TLDR

The authors propose WideDeep: treat multiple frozen LLM calls as neurons with automatically generated evaluation 'roles', stack them into a 2-layer network (one discussion round), and aggregate via voting/averaging. On a new, diverse LLMEval 2 benchmark (2,553 pairwise samples, 15 tasks, 8 abilities) WideDeep outperforms prior single-layer ensembling (FairEval) by a few accuracy points and raises kappa agreement. In Chinese LLM labeling, WideDeep cut human checking time ~4.6× and lowered per-sample cost ~60%, while reaching 74% labeling accuracy and high human agreement.

Problem Statement

LLMs can judge generated text, but single-layer ensembles of LLM evaluations are biased (position, verbosity) and unstable. Prior benchmarks for evaluating LLM evaluators are small or narrow. The paper asks whether making the LLM-evaluator network wider (more independent LLM 'neurons') and slightly deeper (one extra integration round) yields fairer, more stable human-aligned judgments.

Main Contribution

WideDeep: a two-layer, wide LLM evaluator that generates per-sample evaluation perspectives (neuron roles), runs multiple independent LLM evaluations, then integrates them in a second layer.

LLMEval 2: a new diverse evaluation benchmark with 2,553 pairwise samples covering 15 tasks and 8 evaluation abilities.

Key Findings

A two-layer wide LLM network (WideDeep) raises inter-annotator kappa on LLMEval 2 compared to prior baseline.

Numberskappa 0.2807 -> 0.3440≈+0.0633) on LLMEval 2, Table 1

Practical UseIf you need more agreement with humans on open-ended pairwise judgments, run a 2-layer WideDeep ensemble instead of a one-layer ensemble; expect modest but consistent kappa gains on similar benchmarks.

Evidence RefTable 1 (LLMEval 2 Kap.)

WideDeep improves accuracy over FairEval on multiple benchmarks.

NumbersLLMEval 2 acc 0.5735 -> 0.6036 (+3.01 pts); PandaLM acc 0.7147 -> 0.7568 (+4.21 pts), Table 1

Practical UseSwitching to WideDeep can raise ranking accuracy a few percentage points on evaluated tasks; useful for automated pre-labeling or quick A/B of model outputs.

Evidence RefTable 1 (Acc results)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
kappa (LLMEval 2)0.3440 (WideDeep c*2 all)0.2807 (FairEval)+0.0633LLMEval 2Table 1 reports FairEval Kap.=0.2807, WideDeep c*2 (all)=0.3440Table 1
Accuracy0.6036 (WideDeep c*2 all)0.5735 (FairEval)+0.0301LLMEval 2Table 1 shows accuracy rise from 0.5735 to 0.6036Table 1

What To Try In 7 Days

Run LLMEval2 mini (300 samples) to benchmark your current evaluator vs WideDeep.

Implement per-sample role-generation prompts and ensemble 5–15 gpt-3.5-turbo calls, then aggregate by voting.

Replace full cross-annotation with sample checks: measure percent of predictions to inspect and estimate savings.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Performance gains are moderate and shown on the provided benchmarks; real gains depend on your tasks and LLM quality.

Deeper stacks (more than 2 layers) decreased performance in their tests.

When Not To Use

When LLM API costs prevent multiple independent calls per sample.

When ground-truth labels are objective single answers (not subjective preferences).

Failure Modes

If neuron roles converge to similar perspectives, added width gives little benefit.

Position or verbosity bias may persist if order-robustness checks are not applied.

Core Entities

Models

gpt-3.5-turbogpt-4vicuna-13bllama-7bbloom-7bcerebras-gpt-6.7bopt-7bpythia-6.9b

Metrics

AccuracyMacro-F1kappa correlation coefficient

Datasets

LLMEval 2LLMEval2 miniFairEvalPandaLM

Benchmarks

LLMEval 2FairEvalPandaLM