Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
17
Why It Matters For Business
WideDeep can cut manual labeling time and cost by pre-labeling outputs with higher human agreement, so teams can scale human evaluation faster and cheaper while keeping quality checks.
Summary TLDR
The authors propose WideDeep: treat multiple frozen LLM calls as neurons with automatically generated evaluation 'roles', stack them into a 2-layer network (one discussion round), and aggregate via voting/averaging. On a new, diverse LLMEval 2 benchmark (2,553 pairwise samples, 15 tasks, 8 abilities) WideDeep outperforms prior single-layer ensembling (FairEval) by a few accuracy points and raises kappa agreement. In Chinese LLM labeling, WideDeep cut human checking time ~4.6× and lowered per-sample cost ~60%, while reaching 74% labeling accuracy and high human agreement.
Problem Statement
LLMs can judge generated text, but single-layer ensembles of LLM evaluations are biased (position, verbosity) and unstable. Prior benchmarks for evaluating LLM evaluators are small or narrow. The paper asks whether making the LLM-evaluator network wider (more independent LLM 'neurons') and slightly deeper (one extra integration round) yields fairer, more stable human-aligned judgments.
Main Contribution
WideDeep: a two-layer, wide LLM evaluator that generates per-sample evaluation perspectives (neuron roles), runs multiple independent LLM evaluations, then integrates them in a second layer.
LLMEval 2: a new diverse evaluation benchmark with 2,553 pairwise samples covering 15 tasks and 8 evaluation abilities.
Empirical results showing WideDeep improves accuracy and kappa over single-layer baselines and reduces human labeling effort and cost in a Chinese LLM evaluation use case.
Key Findings
A two-layer wide LLM network (WideDeep) raises inter-annotator kappa on LLMEval 2 compared to prior baseline.
WideDeep improves accuracy over FairEval on multiple benchmarks.
Using WideDeep reduced human checking time and cost in a Chinese LLM labeling pipeline.
Generating distinct neuron roles helps performance.
Network depth beyond two layers hurts performance on the tested data.
Results
kappa (LLMEval 2)
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run LLMEval2 mini (300 samples) to benchmark your current evaluator vs WideDeep.
Implement per-sample role-generation prompts and ensemble 5–15 gpt-3.5-turbo calls, then aggregate by voting.
Replace full cross-annotation with sample checks: measure percent of predictions to inspect and estimate savings.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Performance gains are moderate and shown on the provided benchmarks; real gains depend on your tasks and LLM quality.
- Deeper stacks (more than 2 layers) decreased performance in their tests.
- Approach requires many LLM calls; costs rise with neuron count unless cheaper models are used.
- Generated neuron roles and aggregation rules might need task-specific tuning.
When Not To Use
- When LLM API costs prevent multiple independent calls per sample.
- When ground-truth labels are objective single answers (not subjective preferences).
- When the deployed LLMs are much weaker than those used in the paper.
Failure Modes
- If neuron roles converge to similar perspectives, added width gives little benefit.
- Position or verbosity bias may persist if order-robustness checks are not applied.
- Poor prompt templates can produce noisy roles or inconsistent scores.
Core Entities
Models
- gpt-3.5-turbo
- gpt-4
- vicuna-13b
- llama-7b
- bloom-7b
- cerebras-gpt-6.7b
- opt-7b
- pythia-6.9b
Metrics
- Accuracy
- Macro-F1
- kappa correlation coefficient
Datasets
- LLMEval 2
- LLMEval2 mini
- FairEval
- PandaLM
Benchmarks
- LLMEval 2
- FairEval
- PandaLM

