When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

February 3, 20257 min

Overview

Decision SnapshotReady For Pilot

The paper presents a clear metric (PLS), stable experiments across benchmarks and statistical tests, and practical mitigation; results are reproducible but focused on specific judge/generator families and pairwise benchmarks.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, Huan Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automatic leaderboards and internal evaluations can overstate model quality when the same or related LLMs generate training data and judge models; this risks bad product decisions and misallocated resources.

Who Should Care

Summary TLDR

The paper identifies and measures "preference leakage": a bias that appears when an LLM used to generate synthetic training data (the generator) is related to the LLM used to evaluate models (the judge). This relatedness (same model, inheritance, or same family) causes judges to prefer student models trained on that synthetic data. The authors define a preference leakage score (PLS), run controlled experiments across multiple LLMs and benchmarks, show leakage is stronger with greater relatedness and more synthetic data, and test mitigation steps — contextual calibration works best.

Problem Statement

Using the same or related LLMs to synthesize training data and to judge model outputs can bias automatic evaluations. This "preference leakage" inflates scores for student models that inherit stylistic or formatting cues from the generator, undermining fair model comparison.

Main Contribution

Define "preference leakage": evaluators favor student models when generator and judge are related.

Introduce a measurable metric, Preference Leakage Score (PLS), for pairwise judge bias.

Key Findings

Preference leakage creates measurable bias in LLM judges.

NumbersPLS averages up to 23.6% (Mistral with GPT-4o & Gemini, Table 1)

Practical UseIf you train models on synthetic data from a given LLM, avoid using that same or a closely related LLM as the evaluator; reported improvements may be inflated by ~20–30% on tested benchmarks.

Evidence RefTable 1, Section 4.2

Degree of relatedness predicts leakage strength.

NumbersSame-model avg PLS 23.6%; same-family (same series) avg PLS 8.9%; different-series 2.8% (Table 2)

Practical UsePrefer independent evaluator models (different family/series) to reduce biased judgments; family-level separation lowers PLS from ~24% to under ~3% in tested cases.

Evidence RefTable 2, Section 5.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Preference Leakage Score (example)Mistral (GPT-4o & Gemini) avg PLS = 23.6%Arena-Hard & AlpacaEval 2.0Table 1; Section 4.2Table 1
Preference Leakage Score (example)Qwen-2.5 (GPT-4o & Gemini) avg PLS = 27.9%Arena-Hard & AlpacaEval 2.0Table 1; Section 4.2Table 1

What To Try In 7 Days

Check evaluator vs generator lineage: avoid same-family judges for models trained on synthetic data.

Run a small PLS check: compare judge choices when generator-related vs unrelated judges.

Paraphrase or normalize candidate outputs before automated judging to cut stylistic bias quickly.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

https://github.com/llm-as-a-judge (resources referenced)AlpacaEval 2.0 (public benchmark)Arena-Hard (public benchmark)Ultrafeedback dataset (public)

Risks & Boundaries

Limitations

Experiments use a subset of judge families and pairwise benchmarks; other judges may behave differently.

PLS focuses on pairwise settings; multi-judge aggregation effects need more study.

When Not To Use

When all evaluations are human-only and not automated.

When generator and judge models are provably independent and vetted.

Failure Modes

Calibration may overcorrect and penalize legitimately better responses.

Detectors for stylistic leakage can miss subtle semantic alignment that still biases judges.

Core Entities

Models

GPT-4o-202411-20Gemini-1.5-flashLLaMA-3.3-70B-InstructturboMistral-7B-v0.1Qwen-2.5-14BClaude-3.5-SonnetQwen-3-8B

Metrics

Preference Leakage Score (PLS)Error Bias

Datasets

UltrafeedbackOASSTLIMAMOSS

Benchmarks

Arena-HardAlpacaEval 2.0PPEMTBenchHuman Preference

Context Entities

Models

VicunaAlpacaGPT-3.5-turboClaude-3.5Gemini-2.0

Metrics

win-rateSpearman correlation (reported for Arena-Hard)

Datasets

Arena-Hard (m-ARENAHARD Chinese)XALPACAEVAL Chinese

Benchmarks

LMArenaleaderboards referenced in Section 5