LLM judges favor 'new' and 'expert' labels but never admit it.

September 30, 20258 min

Overview

Decision SnapshotNeeds Validation

Strong controlled signals (100 pairs, two models, deterministic decoding) support the claims, but scale is limited to two datasets and two judge models.

Citations0

Evidence Strength0.85

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 20%

Production readiness: 30%

Novelty: 35%

Authors

Arash Marioriyad, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

Links

Abstract / PDF / Data

Why It Matters For Business

If you use LLMs to auto-evaluate content (summaries, answers, stories), simple labels like 'written recently' or 'by an expert' can shift scores substantially. Decisions based on such automated judgments can be biased without visible reasons.

Who Should Care

Summary TLDR

When asked to compare two candidate outputs, popular LLMs (GPT-4o and Gemini-2.5-Flash) shift their choices based on simple labels like author identity (Expert/Human/LLM/Unknown) and time (New/Old). These shifts are measurable (up to +30% change) but the models' written justifications never mention the labels, showing non‑faithful reasoning. Result: automatic LLM evaluators can be shortcut-prone and unreliable unless their sensitivity to surface cues is checked.

Problem Statement

LLMs are used to judge outputs from other systems. A faithful judge should decide based only on content quality. The paper asks: do simple cue labels (who wrote the response, when it was written) change LLM verdicts, and do the models acknowledge those cues in their explanations?

Main Contribution

Introduce a controlled test that attaches simple provenance (HUMAN/EXPERT/LLM/UNKNOWN) and recency (OLD 1950 / NEW 2025) labels to candidate responses while keeping content fixed.

Measure how labels shift binary pairwise judgments (Verdict Shift Rate, VSR) and whether models mention the labels in their justifications (Cue Acknowledgment Rate, CAR).

Key Findings

Recency labels cause consistent selection shifts toward 'New' responses.

NumbersGPT-4o VSR +30% on ELI5; Gemini +16% on ELI5; GPT-4o +16% on LitBench; Gemini +4% on LitBench

Practical UseDon't trust an LLM judge that prefers newer-labeled outputs; add controls (swap labels, remove timestamps) before using LLM judgments in evaluations.

Evidence RefFigure 1; Table 3 (ELI5); Table 5 (LitBench)

Provenance labels create a consistent trust hierarchy: EXPERT > HUMAN > LLM > UNKNOWN.

NumbersExample: GPT-4o Human-Unknown VSR +7% (ELI5); +14% (LitBench); Expert-Unknown VSR +18% (ELI5)

Practical UseModels favor outputs labeled as from experts or humans. When using LLM judges, blind provenance (hide author/source) or randomize labels to avoid bias.

Evidence RefTable 1; Table 2 (ELI5); Table 4 (LitBench)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Recency VSR (ELI5)+30% (GPT-4o); +16% (Gemini-2.5-Flash)no-cue condition / swapped labels+30% (GPT-4o vs OLD), +16% (Gemini)ELI5Table 3; Figure 1Table 3, Figure 1
Recency VSR (LitBench)+16% (GPT-4o); +4% (Gemini-2.5-Flash)no-cue condition / swapped labels+16% (GPT-4o), +4% (Gemini)LitBenchTable 5; Section 3Table 5

What To Try In 7 Days

Run a quick cue-sensitivity test: take 50 existing pairwise comparisons and swap simple labels (author/time) to measure VSR for your chosen judge model.

Blind provenance and timestamps in evaluation prompts. Re-run evaluations and compare selection rates to detect labeling bias.

Treat model explanations skeptically: add perturbation-based checks (swap labels, shuffle positions) rather than trusting rationales.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

Risks & Boundaries

Limitations

Only two judge models tested (GPT-4o, Gemini-2.5-Flash); results may differ with other models.

Each dataset subsampled to 100 pairwise tasks; sample size limits generality.

When Not To Use

Do not rely solely on LLM-as-a-judge outputs for high-stakes decisions without bias checks.

Avoid using these evaluation prompts unchanged when provenance or timestamp metadata is available to the model.

Failure Modes

Unacknowledged cue-driven bias: verdicts driven by labels not content.

Unfaithful rationales: explanations that omit the true drivers of decisions.

Core Entities

Models

GPT-4oGemini-2.5-Flash

Metrics

Verdict Shift Rate (VSR)Cue Acknowledgment Rate (CAR)

Datasets

ELI5LitBench

Benchmarks

LitBench

Context Entities

Models

general-purpose conversational LLMs

Metrics

selection ratefirst-response selection rate

Datasets

long-form QA (ELI5)creative writing (LitBench)