Overview
Production Readiness
0.3
Novelty Score
0.35
Cost Impact Score
0.2
Citation Count
0
Why It Matters For Business
If you use LLMs to auto-evaluate content (summaries, answers, stories), simple labels like 'written recently' or 'by an expert' can shift scores substantially. Decisions based on such automated judgments can be biased without visible reasons.
Summary TLDR
When asked to compare two candidate outputs, popular LLMs (GPT-4o and Gemini-2.5-Flash) shift their choices based on simple labels like author identity (Expert/Human/LLM/Unknown) and time (New/Old). These shifts are measurable (up to +30% change) but the models' written justifications never mention the labels, showing non‑faithful reasoning. Result: automatic LLM evaluators can be shortcut-prone and unreliable unless their sensitivity to surface cues is checked.
Problem Statement
LLMs are used to judge outputs from other systems. A faithful judge should decide based only on content quality. The paper asks: do simple cue labels (who wrote the response, when it was written) change LLM verdicts, and do the models acknowledge those cues in their explanations?
Main Contribution
Introduce a controlled test that attaches simple provenance (HUMAN/EXPERT/LLM/UNKNOWN) and recency (OLD 1950 / NEW 2025) labels to candidate responses while keeping content fixed.
Measure how labels shift binary pairwise judgments (Verdict Shift Rate, VSR) and whether models mention the labels in their justifications (Cue Acknowledgment Rate, CAR).
Run experiments on two public datasets (ELI5 for factual/explanatory QA and LitBench for creative writing) with two popular judge models (GPT-4o, Gemini-2.5-Flash) under deterministic decoding.
Key Findings
Recency labels cause consistent selection shifts toward 'New' responses.
Provenance labels create a consistent trust hierarchy: EXPERT > HUMAN > LLM > UNKNOWN.
Models' written justifications do not acknowledge the injected labels at all.
Sensitivity to cues varies by task and model: GPT-4o is generally more cue-sensitive than Gemini-2.5-Flash.
Results
Recency VSR (ELI5)
Recency VSR (LitBench)
Provenance VSR (Human vs Unknown)
Provenance VSR (Expert vs Unknown)
Cue Acknowledgment Rate (CAR)
Who Should Care
What To Try In 7 Days
Run a quick cue-sensitivity test: take 50 existing pairwise comparisons and swap simple labels (author/time) to measure VSR for your chosen judge model.
Blind provenance and timestamps in evaluation prompts. Re-run evaluations and compare selection rates to detect labeling bias.
Treat model explanations skeptically: add perturbation-based checks (swap labels, shuffle positions) rather than trusting rationales.
Reproducibility
Data Urls
- ELI5 (ACL 2019)
- https://arxiv.org/abs/2507.00769
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Only two judge models tested (GPT-4o, Gemini-2.5-Flash); results may differ with other models.
- Each dataset subsampled to 100 pairwise tasks; sample size limits generality.
- Cues are simple, short sentences — other cue formulations might produce different effects.
- Deterministic decoding isolates cue effects but does not reflect stochastic judge deployments.
When Not To Use
- Do not rely solely on LLM-as-a-judge outputs for high-stakes decisions without bias checks.
- Avoid using these evaluation prompts unchanged when provenance or timestamp metadata is available to the model.
Failure Modes
- Unacknowledged cue-driven bias: verdicts driven by labels not content.
- Unfaithful rationales: explanations that omit the true drivers of decisions.
- Domain sensitivity: provenance matters more for subjective creative tasks, recency for factual QA.
Core Entities
Models
- GPT-4o
- Gemini-2.5-Flash
Metrics
- Verdict Shift Rate (VSR)
- Cue Acknowledgment Rate (CAR)
Datasets
- ELI5
- LitBench
Benchmarks
- LitBench
Context Entities
Models
- general-purpose conversational LLMs
Metrics
- selection rate
- first-response selection rate
Datasets
- long-form QA (ELI5)
- creative writing (LitBench)

