Overview
The paper presents a clear metric (PLS), stable experiments across benchmarks and statistical tests, and practical mitigation; results are reproducible but focused on specific judge/generator families and pairwise benchmarks.
Citations2
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 2/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Automatic leaderboards and internal evaluations can overstate model quality when the same or related LLMs generate training data and judge models; this risks bad product decisions and misallocated resources.
Who Should Care
Summary TLDR
The paper identifies and measures "preference leakage": a bias that appears when an LLM used to generate synthetic training data (the generator) is related to the LLM used to evaluate models (the judge). This relatedness (same model, inheritance, or same family) causes judges to prefer student models trained on that synthetic data. The authors define a preference leakage score (PLS), run controlled experiments across multiple LLMs and benchmarks, show leakage is stronger with greater relatedness and more synthetic data, and test mitigation steps — contextual calibration works best.
Problem Statement
Using the same or related LLMs to synthesize training data and to judge model outputs can bias automatic evaluations. This "preference leakage" inflates scores for student models that inherit stylistic or formatting cues from the generator, undermining fair model comparison.
Main Contribution
Define "preference leakage": evaluators favor student models when generator and judge are related.
Introduce a measurable metric, Preference Leakage Score (PLS), for pairwise judge bias.
Key Findings
Preference leakage creates measurable bias in LLM judges.
Degree of relatedness predicts leakage strength.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Preference Leakage Score (example) | Mistral (GPT-4o & Gemini) avg PLS = 23.6% | — | — | Arena-Hard & AlpacaEval 2.0 | Table 1; Section 4.2 | Table 1 |
| Preference Leakage Score (example) | Qwen-2.5 (GPT-4o & Gemini) avg PLS = 27.9% | — | — | Arena-Hard & AlpacaEval 2.0 | Table 1; Section 4.2 | Table 1 |
What To Try In 7 Days
Check evaluator vs generator lineage: avoid same-family judges for models trained on synthetic data.
Run a small PLS check: compare judge choices when generator-related vs unrelated judges.
Paraphrase or normalize candidate outputs before automated judging to cut stylistic bias quickly.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Experiments use a subset of judge families and pairwise benchmarks; other judges may behave differently.
PLS focuses on pairwise settings; multi-judge aggregation effects need more study.
When Not To Use
When all evaluations are human-only and not automated.
When generator and judge models are provably independent and vetted.
Failure Modes
Calibration may overcorrect and penalize legitimately better responses.
Detectors for stylistic leakage can miss subtle semantic alignment that still biases judges.

