Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
CCE makes automated evaluation more reliable by surfacing subtle errors and richer rationales, cutting the need for as much human re-checking and improving training-data selection for SFT.
Summary TLDR
The paper introduces Crowd-based Comparative Evaluation (CCE). Instead of judging two responses in isolation, CCE generates many synthetic "crowd" responses and asks the judge LLM to compare the candidates against those crowd responses. That extra context makes the judge produce longer, more detailed chain-of-thought (CoT) rationales and improves evaluation accuracy. On five pairwise preference benchmarks CCE raises average judge accuracy by 6.7%. CCE also yields better distilled small judges (+~1.9–5.6% on tested setups) and improves rejection sampling for SFT, giving consistent gains on MTBench and AlpacaEval-2.
Problem Statement
LLM-as-a-Judge often gives incomplete or shallow chain-of-thought (CoT) judgments that miss nuanced errors. Common fixes—majority voting or adding fixed criteria—either cost a lot or fail to adapt to the specific details of each response. The paper asks: how can we guide judge LLMs to find deeper, response-specific details without blowing up compute?
Main Contribution
CCE: a runtime method that generates diverse synthetic "crowd" responses and uses comparisons to surface fine-grained differences, then conditions the judge on those crowd judgments.
A practical selection pipeline (Criticizing Selection + Outcome Removal) that keeps critical judgments and strips explicit verdicts to reduce bias.
Show that CCE improves judge accuracy (avg +6.7% on five benchmarks), enables better distillation to smaller judge models, and yields more effective rejection sampling for SFT.
Open-source code and prompts to reproduce the pipeline and selection/processing steps.
Key Findings
CCE improves LLM-as-a-Judge accuracy across five pairwise benchmarks.
CCE yields better distilled small judges when their training data contains CCE CoTs.
Crowd rejection sampling picks better SFT training responses and improves finetuned model scores.
Scaling the number of crowd judgments tends to increase accuracy and CoT length.
CCE CoTs contain more key points and higher coverage of the candidate responses.
Results
Accuracy
Accuracy
Accuracy
SFT
Who Should Care
What To Try In 7 Days
Run CCE at test time: generate 8–16 crowd responses per case and feed selected crowd judgments into your judge prompt.
Use Criticizing Selection + Outcome Removal: keep loss-side judgments and strip verdicts before final inference.
Apply crowd rejection sampling to a small SFT pool and compare downstream metrics (MTBench / AlpacaEval).
Optimization Features
Training Optimization
- CoT distillation from CCE judgments to small judges
Inference Optimization
- Inference-time scaling via multiple crowd judgments (0–16)
Reproducibility
Code Urls
Data Urls
- RewardBench (cited)
- HelpSteer2 (cited)
- MTBench-Human (cited)
- JudgeBench (cited)
- EvalBias (cited)
- LIMA (HuggingFace link in appendix)
- TULU3-SFT (HuggingFace link in appendix)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- No iterative self-refinement (paper does not study repeated self-iteration).
- Unclear which crowd LLMs contribute most—they use many models but do not ablate influence per-model.
- Approach adds inference-time cost and latency due to generating crowd responses and extra judgments.
- Quality depends on synthetic crowd responses; noisy crowd judgments can mislead the final judge.
When Not To Use
- When strict low-latency or minimal inference cost is required (CCE needs extra generation and judge calls).
- If you cannot generate diverse synthetic crowd responses due to API or licensing limits.
- When dataset size is tiny and added selection complexity offers little benefit.
Failure Modes
- Crowd responses replicate the same error or bias and reinforce a wrong judgment.
- Selection heuristics pick uninformative judgments if outcomes correlate with verbosity rather than correctness.
- Outcome-Removal may remove useful summary signals if misapplied.
- Higher compute budgets may still fail to help if judge LLM is poorly calibrated.
Core Entities
Models
- GPT-4o
- Qwen 2.5-7B-Instruct
- Qwen 2.5-32B-Instruct
- Qwen 2.5-72B-Instruct
- Llama 3.3-70B-Instruct
- Llama 3.1-8B-Base
- Mistral-Nemo
Metrics
- Accuracy
- MTBench score
- AlpacaEval-2 score
- CoT key point count
- CoT coverage rate
Datasets
- RewardBench
- HelpSteer2
- MTBench-Human
- JudgeBench
- EvalBias
- LIMA
- SFT
- TULU3-Preference
Benchmarks
- RewardBench
- HelpSteer2
- MTBench-Human
- JudgeBench
- EvalBias
Context Entities
Models
- GPT-4o-mini
- Qwen2.5-0.5B-Instruct
- Qwen-2.5-1.5B-Instruct
- Qwen2.5-3B-Instruct
- Llama-3.2-3B-Instruct
- Mistral10-Instruct
- Claude-3.5-Sonnet
- DeepSeek-v3

