Use synthetic crowd comparisons to make LLM judges give deeper, more reliable chain-of-thought evaluations

February 18, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Qiyuan Zhang, Yufei Wang, Yuxin Jiang, Liangyou Li, Chuhan Wu, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma

Links

Abstract / PDF

Why It Matters For Business

CCE makes automated evaluation more reliable by surfacing subtle errors and richer rationales, cutting the need for as much human re-checking and improving training-data selection for SFT.

Summary TLDR

The paper introduces Crowd-based Comparative Evaluation (CCE). Instead of judging two responses in isolation, CCE generates many synthetic "crowd" responses and asks the judge LLM to compare the candidates against those crowd responses. That extra context makes the judge produce longer, more detailed chain-of-thought (CoT) rationales and improves evaluation accuracy. On five pairwise preference benchmarks CCE raises average judge accuracy by 6.7%. CCE also yields better distilled small judges (+~1.9–5.6% on tested setups) and improves rejection sampling for SFT, giving consistent gains on MTBench and AlpacaEval-2.

Problem Statement

LLM-as-a-Judge often gives incomplete or shallow chain-of-thought (CoT) judgments that miss nuanced errors. Common fixes—majority voting or adding fixed criteria—either cost a lot or fail to adapt to the specific details of each response. The paper asks: how can we guide judge LLMs to find deeper, response-specific details without blowing up compute?

Main Contribution

CCE: a runtime method that generates diverse synthetic "crowd" responses and uses comparisons to surface fine-grained differences, then conditions the judge on those crowd judgments.

A practical selection pipeline (Criticizing Selection + Outcome Removal) that keeps critical judgments and strips explicit verdicts to reduce bias.

Show that CCE improves judge accuracy (avg +6.7% on five benchmarks), enables better distillation to smaller judge models, and yields more effective rejection sampling for SFT.

Open-source code and prompts to reproduce the pipeline and selection/processing steps.

Key Findings

CCE improves LLM-as-a-Judge accuracy across five pairwise benchmarks.

NumbersAverage accuracy: Vanilla 73.6% → CCE 80.3% (gain 6.7%)

CCE yields better distilled small judges when their training data contains CCE CoTs.

NumbersQwen-2.5-7B: Vanilla-distill 61.1% → CCE-distill 63.0% (gain 1.9%)

Crowd rejection sampling picks better SFT training responses and improves finetuned model scores.

NumbersExample: Llama 3.1-8B, AlpacaEval-2 (TULU3 10K): Vanilla 19.92/17.17 → Crowd 22.23/19.74 (+2.31/+2.57)

Scaling the number of crowd judgments tends to increase accuracy and CoT length.

NumbersPerformance and CoT length increase as crowd judgments grow from 0 to 16

CCE CoTs contain more key points and higher coverage of the candidate responses.

Results

Accuracy

ValueVanilla 73.6% → CCE 80.3% (avg gain 6.7%)

BaselineVanilla LLM-as-a-Judge

Accuracy

ValueVanilla avg 74.0% → CCE avg 82.7%

BaselineVanilla

Accuracy

ValueVanilla synthetic judgments avg 61.1% → CCE synthetic judgments avg 63.0%

BaselineDistillation from Vanilla CoTs

SFT

ValueVanilla Rejection 19.92/17.17 → Crowd Rejection 22.23/19.74

BaselineVanilla Rejection Sampling

Who Should Care

What To Try In 7 Days

Run CCE at test time: generate 8–16 crowd responses per case and feed selected crowd judgments into your judge prompt.

Use Criticizing Selection + Outcome Removal: keep loss-side judgments and strip verdicts before final inference.

Apply crowd rejection sampling to a small SFT pool and compare downstream metrics (MTBench / AlpacaEval).

Optimization Features

Training Optimization

  • CoT distillation from CCE judgments to small judges

Inference Optimization

  • Inference-time scaling via multiple crowd judgments (0–16)

Reproducibility

Data Urls

  • RewardBench (cited)
  • HelpSteer2 (cited)
  • MTBench-Human (cited)
  • JudgeBench (cited)
  • EvalBias (cited)
  • LIMA (HuggingFace link in appendix)
  • TULU3-SFT (HuggingFace link in appendix)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • No iterative self-refinement (paper does not study repeated self-iteration).
  • Unclear which crowd LLMs contribute most—they use many models but do not ablate influence per-model.
  • Approach adds inference-time cost and latency due to generating crowd responses and extra judgments.
  • Quality depends on synthetic crowd responses; noisy crowd judgments can mislead the final judge.

When Not To Use

  • When strict low-latency or minimal inference cost is required (CCE needs extra generation and judge calls).
  • If you cannot generate diverse synthetic crowd responses due to API or licensing limits.
  • When dataset size is tiny and added selection complexity offers little benefit.

Failure Modes

  • Crowd responses replicate the same error or bias and reinforce a wrong judgment.
  • Selection heuristics pick uninformative judgments if outcomes correlate with verbosity rather than correctness.
  • Outcome-Removal may remove useful summary signals if misapplied.
  • Higher compute budgets may still fail to help if judge LLM is poorly calibrated.

Core Entities

Models

  • GPT-4o
  • Qwen 2.5-7B-Instruct
  • Qwen 2.5-32B-Instruct
  • Qwen 2.5-72B-Instruct
  • Llama 3.3-70B-Instruct
  • Llama 3.1-8B-Base
  • Mistral-Nemo

Metrics

  • Accuracy
  • MTBench score
  • AlpacaEval-2 score
  • CoT key point count
  • CoT coverage rate

Datasets

  • RewardBench
  • HelpSteer2
  • MTBench-Human
  • JudgeBench
  • EvalBias
  • LIMA
  • SFT
  • TULU3-Preference

Benchmarks

  • RewardBench
  • HelpSteer2
  • MTBench-Human
  • JudgeBench
  • EvalBias

Context Entities

Models

  • GPT-4o-mini
  • Qwen2.5-0.5B-Instruct
  • Qwen-2.5-1.5B-Instruct
  • Qwen2.5-3B-Instruct
  • Llama-3.2-3B-Instruct
  • Mistral10-Instruct
  • Claude-3.5-Sonnet
  • DeepSeek-v3