Use many LLM ‘reviewers’ plus one round of discussion to get fairer, cheaper human-aligned evaluations

August 3, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

17

Authors

Xinghua Zhang, Bowen Yu, Haiyang Yu, Yangyu Lv, Tingwen Liu, Fei Huang, Hongbo Xu, Yongbin Li

Links

Abstract / PDF

Why It Matters For Business

WideDeep can cut manual labeling time and cost by pre-labeling outputs with higher human agreement, so teams can scale human evaluation faster and cheaper while keeping quality checks.

Summary TLDR

The authors propose WideDeep: treat multiple frozen LLM calls as neurons with automatically generated evaluation 'roles', stack them into a 2-layer network (one discussion round), and aggregate via voting/averaging. On a new, diverse LLMEval 2 benchmark (2,553 pairwise samples, 15 tasks, 8 abilities) WideDeep outperforms prior single-layer ensembling (FairEval) by a few accuracy points and raises kappa agreement. In Chinese LLM labeling, WideDeep cut human checking time ~4.6× and lowered per-sample cost ~60%, while reaching 74% labeling accuracy and high human agreement.

Problem Statement

LLMs can judge generated text, but single-layer ensembles of LLM evaluations are biased (position, verbosity) and unstable. Prior benchmarks for evaluating LLM evaluators are small or narrow. The paper asks whether making the LLM-evaluator network wider (more independent LLM 'neurons') and slightly deeper (one extra integration round) yields fairer, more stable human-aligned judgments.

Main Contribution

WideDeep: a two-layer, wide LLM evaluator that generates per-sample evaluation perspectives (neuron roles), runs multiple independent LLM evaluations, then integrates them in a second layer.

LLMEval 2: a new diverse evaluation benchmark with 2,553 pairwise samples covering 15 tasks and 8 evaluation abilities.

Empirical results showing WideDeep improves accuracy and kappa over single-layer baselines and reduces human labeling effort and cost in a Chinese LLM evaluation use case.

Key Findings

A two-layer wide LLM network (WideDeep) raises inter-annotator kappa on LLMEval 2 compared to prior baseline.

Numberskappa 0.2807 -> 0.3440 (Δ≈+0.0633) on LLMEval 2, Table 1

WideDeep improves accuracy over FairEval on multiple benchmarks.

NumbersLLMEval 2 acc 0.5735 -> 0.6036 (+3.01 pts); PandaLM acc 0.7147 -> 0.7568 (+4.21 pts), Table 1

Using WideDeep reduced human checking time and cost in a Chinese LLM labeling pipeline.

Numbers4.6× faster; 60% lower average annotation cost; labeling accuracy 74%; 93% human agreement, Section 5.4 and Table 4

Generating distinct neuron roles helps performance.

NumbersRemoving neuron roles drops Acc ~1.3–1.95 pts and Macro-F1 by up to ~5.8 pts (Table 3)

Network depth beyond two layers hurts performance on the tested data.

NumbersPerformance declines when l > 2 in ablations (Section 5.3)

Results

kappa (LLMEval 2)

Value0.3440 (WideDeep c*2 all)

Baseline0.2807 (FairEval)

Accuracy

Value0.6036 (WideDeep c*2 all)

Baseline0.5735 (FairEval)

Accuracy

Value0.7568 (WideDeep c*2 all)

Baseline0.7147 (FairEval)

Accuracy

Value0.74 (WideDeep)

Baseline0.67 (GPT-4 alone)

Who Should Care

What To Try In 7 Days

Run LLMEval2 mini (300 samples) to benchmark your current evaluator vs WideDeep.

Implement per-sample role-generation prompts and ensemble 5–15 gpt-3.5-turbo calls, then aggregate by voting.

Replace full cross-annotation with sample checks: measure percent of predictions to inspect and estimate savings.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Performance gains are moderate and shown on the provided benchmarks; real gains depend on your tasks and LLM quality.
  • Deeper stacks (more than 2 layers) decreased performance in their tests.
  • Approach requires many LLM calls; costs rise with neuron count unless cheaper models are used.
  • Generated neuron roles and aggregation rules might need task-specific tuning.

When Not To Use

  • When LLM API costs prevent multiple independent calls per sample.
  • When ground-truth labels are objective single answers (not subjective preferences).
  • When the deployed LLMs are much weaker than those used in the paper.

Failure Modes

  • If neuron roles converge to similar perspectives, added width gives little benefit.
  • Position or verbosity bias may persist if order-robustness checks are not applied.
  • Poor prompt templates can produce noisy roles or inconsistent scores.

Core Entities

Models

  • gpt-3.5-turbo
  • gpt-4
  • vicuna-13b
  • llama-7b
  • bloom-7b
  • cerebras-gpt-6.7b
  • opt-7b
  • pythia-6.9b

Metrics

  • Accuracy
  • Macro-F1
  • kappa correlation coefficient

Datasets

  • LLMEval 2
  • LLMEval2 mini
  • FairEval
  • PandaLM

Benchmarks

  • LLMEval 2
  • FairEval
  • PandaLM