Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
2
Why It Matters For Business
Automatic leaderboards and internal evaluations can overstate model quality when the same or related LLMs generate training data and judge models; this risks bad product decisions and misallocated resources.
Summary TLDR
The paper identifies and measures "preference leakage": a bias that appears when an LLM used to generate synthetic training data (the generator) is related to the LLM used to evaluate models (the judge). This relatedness (same model, inheritance, or same family) causes judges to prefer student models trained on that synthetic data. The authors define a preference leakage score (PLS), run controlled experiments across multiple LLMs and benchmarks, show leakage is stronger with greater relatedness and more synthetic data, and test mitigation steps — contextual calibration works best.
Problem Statement
Using the same or related LLMs to synthesize training data and to judge model outputs can bias automatic evaluations. This "preference leakage" inflates scores for student models that inherit stylistic or formatting cues from the generator, undermining fair model comparison.
Main Contribution
Define "preference leakage": evaluators favor student models when generator and judge are related.
Introduce a measurable metric, Preference Leakage Score (PLS), for pairwise judge bias.
Extensive experiments across multiple LLMs, benchmarks, and conditions showing PLS > 0 is common.
Diagnose mechanisms: stylistic/format cues drive leakage and smaller students are more affected.
Benchmark mitigation methods; contextual calibration reduces bias most effectively.
Key Findings
Preference leakage creates measurable bias in LLM judges.
Degree of relatedness predicts leakage strength.
More synthetic data increases leakage linearly.
Learning method changes leakage magnitude.
Surface-level cues drive much of the leakage.
Contextual calibration best mitigates leakage on human-labeled data.
Results
Preference Leakage Score (example)
Preference Leakage Score (example)
PLS by learning method
Mitigation (Error Bias)
Who Should Care
What To Try In 7 Days
Check evaluator vs generator lineage: avoid same-family judges for models trained on synthetic data.
Run a small PLS check: compare judge choices when generator-related vs unrelated judges.
Paraphrase or normalize candidate outputs before automated judging to cut stylistic bias quickly.
Reproducibility
Data Urls
- https://github.com/llm-as-a-judge (resources referenced)
- AlpacaEval 2.0 (public benchmark)
- Arena-Hard (public benchmark)
- Ultrafeedback dataset (public)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments use a subset of judge families and pairwise benchmarks; other judges may behave differently.
- PLS focuses on pairwise settings; multi-judge aggregation effects need more study.
- Real-world leaderboards lack full provenance metadata, limiting large-scale correction tests.
When Not To Use
- When all evaluations are human-only and not automated.
- When generator and judge models are provably independent and vetted.
- When the application tolerates stylistic preference (e.g., branded voice checks).
Failure Modes
- Calibration may overcorrect and penalize legitimately better responses.
- Detectors for stylistic leakage can miss subtle semantic alignment that still biases judges.
- Mitigations tuned on one benchmark may not generalize to different tasks or languages.
Core Entities
Models
- GPT-4o-202411-20
- Gemini-1.5-flash
- LLaMA-3.3-70B-Instructturbo
- Mistral-7B-v0.1
- Qwen-2.5-14B
- Claude-3.5-Sonnet
- Qwen-3-8B
Metrics
- Preference Leakage Score (PLS)
- Error Bias
Datasets
- Ultrafeedback
- OASST
- LIMA
- MOSS
Benchmarks
- Arena-Hard
- AlpacaEval 2.0
- PPE
- MTBench
- Human Preference
Context Entities
Models
- Vicuna
- Alpaca
- GPT-3.5-turbo
- Claude-3.5
- Gemini-2.0
Metrics
- win-rate
- Spearman correlation (reported for Arena-Hard)
Datasets
- Arena-Hard (m-ARENAHARD Chinese)
- XALPACAEVAL Chinese
Benchmarks
- LMArena
- leaderboards referenced in Section 5

