Overview
The method shows consistent small-to-moderate gains on three benchmarks and a clear cost/time win in one Chinese labeling case; however results rely on chosen LLMs, prompt design, and dataset composition.
Citations17
Evidence Strength0.65
Confidence0.78
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
WideDeep can cut manual labeling time and cost by pre-labeling outputs with higher human agreement, so teams can scale human evaluation faster and cheaper while keeping quality checks.
Who Should Care
Summary TLDR
The authors propose WideDeep: treat multiple frozen LLM calls as neurons with automatically generated evaluation 'roles', stack them into a 2-layer network (one discussion round), and aggregate via voting/averaging. On a new, diverse LLMEval 2 benchmark (2,553 pairwise samples, 15 tasks, 8 abilities) WideDeep outperforms prior single-layer ensembling (FairEval) by a few accuracy points and raises kappa agreement. In Chinese LLM labeling, WideDeep cut human checking time ~4.6× and lowered per-sample cost ~60%, while reaching 74% labeling accuracy and high human agreement.
Problem Statement
LLMs can judge generated text, but single-layer ensembles of LLM evaluations are biased (position, verbosity) and unstable. Prior benchmarks for evaluating LLM evaluators are small or narrow. The paper asks whether making the LLM-evaluator network wider (more independent LLM 'neurons') and slightly deeper (one extra integration round) yields fairer, more stable human-aligned judgments.
Main Contribution
WideDeep: a two-layer, wide LLM evaluator that generates per-sample evaluation perspectives (neuron roles), runs multiple independent LLM evaluations, then integrates them in a second layer.
LLMEval 2: a new diverse evaluation benchmark with 2,553 pairwise samples covering 15 tasks and 8 evaluation abilities.
Key Findings
A two-layer wide LLM network (WideDeep) raises inter-annotator kappa on LLMEval 2 compared to prior baseline.
WideDeep improves accuracy over FairEval on multiple benchmarks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| kappa (LLMEval 2) | 0.3440 (WideDeep c*2 all) | 0.2807 (FairEval) | +0.0633 | LLMEval 2 | Table 1 reports FairEval Kap.=0.2807, WideDeep c*2 (all)=0.3440 | Table 1 |
| Accuracy | 0.6036 (WideDeep c*2 all) | 0.5735 (FairEval) | +0.0301 | LLMEval 2 | Table 1 shows accuracy rise from 0.5735 to 0.6036 | Table 1 |
What To Try In 7 Days
Run LLMEval2 mini (300 samples) to benchmark your current evaluator vs WideDeep.
Implement per-sample role-generation prompts and ensemble 5–15 gpt-3.5-turbo calls, then aggregate by voting.
Replace full cross-annotation with sample checks: measure percent of predictions to inspect and estimate savings.
Reproducibility
Risks & Boundaries
Limitations
Performance gains are moderate and shown on the provided benchmarks; real gains depend on your tasks and LLM quality.
Deeper stacks (more than 2 layers) decreased performance in their tests.
When Not To Use
When LLM API costs prevent multiple independent calls per sample.
When ground-truth labels are objective single answers (not subjective preferences).
Failure Modes
If neuron roles converge to similar perspectives, added width gives little benefit.
Position or verbosity bias may persist if order-robustness checks are not applied.

