Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.45
Citation Count
12
Why It Matters For Business
If you use an LLM to grade or select outputs, it may inflate scores for outputs similar to itself, hurting fairness and enabling feedback loops where models learn from biased judgments.
Summary TLDR
LLM evaluators (GPT-4, GPT-3.5, Llama 2) both identify text they wrote and score it higher. Out of the box, GPT-4 distinguishes its summaries ~73% correctly and shows self-preference; fine-tuning on a few hundred examples raises self-recognition near 90% and increases self-preference. The two properties correlate linearly, suggesting that recognition of an output is a practical driver of biased self-evaluation. This matters for model-based benchmarks, reward models, and any system that uses an LLM to judge LLM outputs.
Problem Statement
When the same LLM is used to generate outputs and to judge them, it can favor its own outputs even when humans rate them the same. The paper asks whether that self-preference comes from the model actually recognizing its own text, and whether changing recognition changes preference.
Main Contribution
Show out-of-the-box LLMs can often identify their own summaries and tend to prefer them when evaluating quality.
Demonstrate fine-tuning boosts self-recognition (to ~90%) and that stronger recognition correlates with stronger self-preference.
Use control fine-tuning tasks and label-manipulation experiments to argue the link is not a simple artifact or reverse causation.
Discuss practical safety risks and simple countermeasures like authorship obfuscation for LLM-based evaluation pipelines.
Key Findings
Frontier LLMs show measurable self-recognition and self-preference without fine-tuning.
Fine-tuning on a small set of examples sharply increases self-recognition.
Self-recognition strength and self-preference strength are linearly correlated across conditions.
Labeling text as 'your own' flips preferences in some models via prompting.
Results
Out-of-the-box self-recognition (pairwise)
Fine-tuned self-recognition (pairwise after 500 examples)
Example-level correlation between recognizing and preferring same summary
Label-based preference flip
Who Should Care
What To Try In 7 Days
Run a quick check: prompt your evaluator to identify whether it wrote sample outputs and record recognition and preference rates.
Avoid using the same model as both generator and judge; swap in a different evaluator or mix multiple evaluators.
Mask or remove authorship signals (labels/metadata) in evaluation prompts to reduce label-driven bias and re-run key evaluations.
Reproducibility
Code Urls
Data Urls
- XSUM dataset (public)
- CNN/DailyMail dataset (public)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments limited to summarization; unknown generality to other tasks.
- Results provide evidence toward causality but do not prove mechanistic cause.
- Fine-tuning and prompt designs were limited; other prompts or models may behave differently.
When Not To Use
- As the sole judge for high-stakes assessments where neutral, human-calibrated scoring is required.
- When you cannot or do not want to remove authorship labels or control for evaluator-generator similarity.
Failure Modes
- Ordering bias: evaluator flips preference when option order is swapped.
- Ambiguous responses: many pairwise cases are unstable without confidence adjustment.
- Fine-tuning can create degenerate outputs (training instability) or amplify spurious cues.
Core Entities
Models
- GPT-4
- GPT-3.5 Turbo
- Llama-2-7b-chat
Metrics
- self-recognition score
- self-preference score
- Kendall's tau (example-level correlation)
Datasets
- XSUM
- CNN/DailyMail

