Overview
Strong controlled signals (100 pairs, two models, deterministic decoding) support the claims, but scale is limited to two datasets and two judge models.
Citations0
Evidence Strength0.85
Confidence0.82
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 20%
Production readiness: 30%
Novelty: 35%
Why It Matters For Business
If you use LLMs to auto-evaluate content (summaries, answers, stories), simple labels like 'written recently' or 'by an expert' can shift scores substantially. Decisions based on such automated judgments can be biased without visible reasons.
Who Should Care
Summary TLDR
When asked to compare two candidate outputs, popular LLMs (GPT-4o and Gemini-2.5-Flash) shift their choices based on simple labels like author identity (Expert/Human/LLM/Unknown) and time (New/Old). These shifts are measurable (up to +30% change) but the models' written justifications never mention the labels, showing non‑faithful reasoning. Result: automatic LLM evaluators can be shortcut-prone and unreliable unless their sensitivity to surface cues is checked.
Problem Statement
LLMs are used to judge outputs from other systems. A faithful judge should decide based only on content quality. The paper asks: do simple cue labels (who wrote the response, when it was written) change LLM verdicts, and do the models acknowledge those cues in their explanations?
Main Contribution
Introduce a controlled test that attaches simple provenance (HUMAN/EXPERT/LLM/UNKNOWN) and recency (OLD 1950 / NEW 2025) labels to candidate responses while keeping content fixed.
Measure how labels shift binary pairwise judgments (Verdict Shift Rate, VSR) and whether models mention the labels in their justifications (Cue Acknowledgment Rate, CAR).
Key Findings
Recency labels cause consistent selection shifts toward 'New' responses.
Provenance labels create a consistent trust hierarchy: EXPERT > HUMAN > LLM > UNKNOWN.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Recency VSR (ELI5) | +30% (GPT-4o); +16% (Gemini-2.5-Flash) | no-cue condition / swapped labels | +30% (GPT-4o vs OLD), +16% (Gemini) | ELI5 | Table 3; Figure 1 | Table 3, Figure 1 |
| Recency VSR (LitBench) | +16% (GPT-4o); +4% (Gemini-2.5-Flash) | no-cue condition / swapped labels | +16% (GPT-4o), +4% (Gemini) | LitBench | Table 5; Section 3 | Table 5 |
What To Try In 7 Days
Run a quick cue-sensitivity test: take 50 existing pairwise comparisons and swap simple labels (author/time) to measure VSR for your chosen judge model.
Blind provenance and timestamps in evaluation prompts. Re-run evaluations and compare selection rates to detect labeling bias.
Treat model explanations skeptically: add perturbation-based checks (swap labels, shuffle positions) rather than trusting rationales.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Only two judge models tested (GPT-4o, Gemini-2.5-Flash); results may differ with other models.
Each dataset subsampled to 100 pairwise tasks; sample size limits generality.
When Not To Use
Do not rely solely on LLM-as-a-judge outputs for high-stakes decisions without bias checks.
Avoid using these evaluation prompts unchanged when provenance or timestamp metadata is available to the model.
Failure Modes
Unacknowledged cue-driven bias: verdicts driven by labels not content.
Unfaithful rationales: explanations that omit the true drivers of decisions.

