When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

February 3, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

2

Authors

Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, Huan Liu

Links

Abstract / PDF

Why It Matters For Business

Automatic leaderboards and internal evaluations can overstate model quality when the same or related LLMs generate training data and judge models; this risks bad product decisions and misallocated resources.

Summary TLDR

The paper identifies and measures "preference leakage": a bias that appears when an LLM used to generate synthetic training data (the generator) is related to the LLM used to evaluate models (the judge). This relatedness (same model, inheritance, or same family) causes judges to prefer student models trained on that synthetic data. The authors define a preference leakage score (PLS), run controlled experiments across multiple LLMs and benchmarks, show leakage is stronger with greater relatedness and more synthetic data, and test mitigation steps — contextual calibration works best.

Problem Statement

Using the same or related LLMs to synthesize training data and to judge model outputs can bias automatic evaluations. This "preference leakage" inflates scores for student models that inherit stylistic or formatting cues from the generator, undermining fair model comparison.

Main Contribution

Define "preference leakage": evaluators favor student models when generator and judge are related.

Introduce a measurable metric, Preference Leakage Score (PLS), for pairwise judge bias.

Extensive experiments across multiple LLMs, benchmarks, and conditions showing PLS > 0 is common.

Diagnose mechanisms: stylistic/format cues drive leakage and smaller students are more affected.

Benchmark mitigation methods; contextual calibration reduces bias most effectively.

Key Findings

Preference leakage creates measurable bias in LLM judges.

NumbersPLS averages up to 23.6% (Mistral with GPT-4o & Gemini, Table 1)

Degree of relatedness predicts leakage strength.

NumbersSame-model avg PLS 23.6%; same-family (same series) avg PLS 8.9%; different-series 2.8% (Table 2)

More synthetic data increases leakage linearly.

NumbersPLS rises with synthetic-data fraction (10%→70%), no clear safe threshold (Figure 2b)

Learning method changes leakage magnitude.

NumbersSFT avg PLS 23.6%; DPO 5.2%; ICL -2.7% (Table 3)

Surface-level cues drive much of the leakage.

NumbersRemoving style/format reduced PLS from 17.5% to ~9% (Table 6)

Contextual calibration best mitigates leakage on human-labeled data.

NumbersError Bias fell from 17.8 to 7.3 with contextual calibration (Table 7)

Results

Preference Leakage Score (example)

ValueMistral (GPT-4o & Gemini) avg PLS = 23.6%

Preference Leakage Score (example)

ValueQwen-2.5 (GPT-4o & Gemini) avg PLS = 27.9%

PLS by learning method

ValueSFT 23.6% | DPO 5.2% | ICL -2.7%

BaselineSFT

Mitigation (Error Bias)

ValueContextual calibration Error Bias = 7.3 (down from 17.8)

BaselineBase Error Bias = 17.8

Who Should Care

What To Try In 7 Days

Check evaluator vs generator lineage: avoid same-family judges for models trained on synthetic data.

Run a small PLS check: compare judge choices when generator-related vs unrelated judges.

Paraphrase or normalize candidate outputs before automated judging to cut stylistic bias quickly.

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments use a subset of judge families and pairwise benchmarks; other judges may behave differently.
  • PLS focuses on pairwise settings; multi-judge aggregation effects need more study.
  • Real-world leaderboards lack full provenance metadata, limiting large-scale correction tests.

When Not To Use

  • When all evaluations are human-only and not automated.
  • When generator and judge models are provably independent and vetted.
  • When the application tolerates stylistic preference (e.g., branded voice checks).

Failure Modes

  • Calibration may overcorrect and penalize legitimately better responses.
  • Detectors for stylistic leakage can miss subtle semantic alignment that still biases judges.
  • Mitigations tuned on one benchmark may not generalize to different tasks or languages.

Core Entities

Models

  • GPT-4o-202411-20
  • Gemini-1.5-flash
  • LLaMA-3.3-70B-Instructturbo
  • Mistral-7B-v0.1
  • Qwen-2.5-14B
  • Claude-3.5-Sonnet
  • Qwen-3-8B

Metrics

  • Preference Leakage Score (PLS)
  • Error Bias

Datasets

  • Ultrafeedback
  • OASST
  • LIMA
  • MOSS

Benchmarks

  • Arena-Hard
  • AlpacaEval 2.0
  • PPE
  • MTBench
  • Human Preference

Context Entities

Models

  • Vicuna
  • Alpaca
  • GPT-3.5-turbo
  • Claude-3.5
  • Gemini-2.0

Metrics

  • win-rate
  • Spearman correlation (reported for Arena-Hard)

Datasets

  • Arena-Hard (m-ARENAHARD Chinese)
  • XALPACAEVAL Chinese

Benchmarks

  • LMArena
  • leaderboards referenced in Section 5