When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Overview

Decision SnapshotReady For Pilot

The paper presents a clear metric (PLS), stable experiments across benchmarks and statistical tests, and practical mitigation; results are reproducible but focused on specific judge/generator families and pairwise benchmarks.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, Huan Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automatic leaderboards and internal evaluations can overstate model quality when the same or related LLMs generate training data and judge models; this risks bad product decisions and misallocated resources.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist Founder

Summary TLDR

The paper identifies and measures "preference leakage": a bias that appears when an LLM used to generate synthetic training data (the generator) is related to the LLM used to evaluate models (the judge). This relatedness (same model, inheritance, or same family) causes judges to prefer student models trained on that synthetic data. The authors define a preference leakage score (PLS), run controlled experiments across multiple LLMs and benchmarks, show leakage is stronger with greater relatedness and more synthetic data, and test mitigation steps — contextual calibration works best.

Problem Statement

Using the same or related LLMs to synthesize training data and to judge model outputs can bias automatic evaluations. This "preference leakage" inflates scores for student models that inherit stylistic or formatting cues from the generator, undermining fair model comparison.

Main Contribution

Define "preference leakage": evaluators favor student models when generator and judge are related.

Introduce a measurable metric, Preference Leakage Score (PLS), for pairwise judge bias.

Key Findings

Preference leakage creates measurable bias in LLM judges.

NumbersPLS averages up to 23.6% (Mistral with GPT-4o & Gemini, Table 1)

Practical UseIf you train models on synthetic data from a given LLM, avoid using that same or a closely related LLM as the evaluator; reported improvements may be inflated by ~20–30% on tested benchmarks.

Evidence RefTable 1, Section 4.2

Degree of relatedness predicts leakage strength.

NumbersSame-model avg PLS 23.6%; same-family (same series) avg PLS 8.9%; different-series 2.8% (Table 2)

Practical UsePrefer independent evaluator models (different family/series) to reduce biased judgments; family-level separation lowers PLS from ~24% to under ~3% in tested cases.

Evidence RefTable 2, Section 5.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Preference Leakage Score (example)	Mistral (GPT-4o & Gemini) avg PLS = 23.6%	—	—	Arena-Hard & AlpacaEval 2.0	Table 1; Section 4.2	Table 1
Preference Leakage Score (example)	Qwen-2.5 (GPT-4o & Gemini) avg PLS = 27.9%	—	—	Arena-Hard & AlpacaEval 2.0	Table 1; Section 4.2	Table 1

What To Try In 7 Days

Check evaluator vs generator lineage: avoid same-family judges for models trained on synthetic data.

Run a small PLS check: compare judge choices when generator-related vs unrelated judges.

Paraphrase or normalize candidate outputs before automated judging to cut stylistic bias quickly.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/David-Li0406/Preference-Leakage

Data URLs

https://github.com/llm-as-a-judge (resources referenced)AlpacaEval 2.0 (public benchmark)Arena-Hard (public benchmark)Ultrafeedback dataset (public)

Risks & Boundaries

Limitations

Experiments use a subset of judge families and pairwise benchmarks; other judges may behave differently.

PLS focuses on pairwise settings; multi-judge aggregation effects need more study.

When Not To Use

When all evaluations are human-only and not automated.

When generator and judge models are provably independent and vetted.

Failure Modes

Calibration may overcorrect and penalize legitimately better responses.

Detectors for stylistic leakage can miss subtle semantic alignment that still biases judges.

Core Entities

Models

GPT-4o-202411-20Gemini-1.5-flashLLaMA-3.3-70B-InstructturboMistral-7B-v0.1Qwen-2.5-14BClaude-3.5-SonnetQwen-3-8B

Metrics

Preference Leakage Score (PLS)Error Bias

Datasets

UltrafeedbackOASSTLIMAMOSS

Benchmarks

Arena-HardAlpacaEval 2.0PPEMTBenchHuman Preference

Context Entities

Models

VicunaAlpacaGPT-3.5-turboClaude-3.5Gemini-2.0

Metrics

win-rateSpearman correlation (reported for Arena-Hard)

Datasets

Arena-Hard (m-ARENAHARD Chinese)XALPACAEVAL Chinese

Benchmarks

LMArenaleaderboards referenced in Section 5

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Preference leakage creates measurable bias in LLM judges.

Degree of relatedness predicts leakage strength.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding