LLM judges favor 'new' and 'expert' labels but never admit it.

Overview

Decision SnapshotNeeds Validation

Strong controlled signals (100 pairs, two models, deterministic decoding) support the claims, but scale is limited to two datasets and two judge models.

Citations0

Evidence Strength0.85

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 20%

Production readiness: 30%

Novelty: 35%

Authors

Arash Marioriyad, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

Links

Abstract / PDF / Data

Why It Matters For Business

If you use LLMs to auto-evaluate content (summaries, answers, stories), simple labels like 'written recently' or 'by an expert' can shift scores substantially. Decisions based on such automated judgments can be biased without visible reasons.

Who Should Care

Product Manager ML Engineer CTO

Summary TLDR

When asked to compare two candidate outputs, popular LLMs (GPT-4o and Gemini-2.5-Flash) shift their choices based on simple labels like author identity (Expert/Human/LLM/Unknown) and time (New/Old). These shifts are measurable (up to +30% change) but the models' written justifications never mention the labels, showing non‑faithful reasoning. Result: automatic LLM evaluators can be shortcut-prone and unreliable unless their sensitivity to surface cues is checked.

Problem Statement

LLMs are used to judge outputs from other systems. A faithful judge should decide based only on content quality. The paper asks: do simple cue labels (who wrote the response, when it was written) change LLM verdicts, and do the models acknowledge those cues in their explanations?

Main Contribution

Introduce a controlled test that attaches simple provenance (HUMAN/EXPERT/LLM/UNKNOWN) and recency (OLD 1950 / NEW 2025) labels to candidate responses while keeping content fixed.

Measure how labels shift binary pairwise judgments (Verdict Shift Rate, VSR) and whether models mention the labels in their justifications (Cue Acknowledgment Rate, CAR).

Key Findings

Recency labels cause consistent selection shifts toward 'New' responses.

NumbersGPT-4o VSR +30% on ELI5; Gemini +16% on ELI5; GPT-4o +16% on LitBench; Gemini +4% on LitBench

Practical UseDon't trust an LLM judge that prefers newer-labeled outputs; add controls (swap labels, remove timestamps) before using LLM judgments in evaluations.

Evidence RefFigure 1; Table 3 (ELI5); Table 5 (LitBench)

Provenance labels create a consistent trust hierarchy: EXPERT > HUMAN > LLM > UNKNOWN.

NumbersExample: GPT-4o Human-Unknown VSR +7% (ELI5); +14% (LitBench); Expert-Unknown VSR +18% (ELI5)

Practical UseModels favor outputs labeled as from experts or humans. When using LLM judges, blind provenance (hide author/source) or randomize labels to avoid bias.

Evidence RefTable 1; Table 2 (ELI5); Table 4 (LitBench)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Recency VSR (ELI5)	+30% (GPT-4o); +16% (Gemini-2.5-Flash)	no-cue condition / swapped labels	+30% (GPT-4o vs OLD), +16% (Gemini)	ELI5	Table 3; Figure 1	Table 3, Figure 1
Recency VSR (LitBench)	+16% (GPT-4o); +4% (Gemini-2.5-Flash)	no-cue condition / swapped labels	+16% (GPT-4o), +4% (Gemini)	LitBench	Table 5; Section 3	Table 5

What To Try In 7 Days

Run a quick cue-sensitivity test: take 50 existing pairwise comparisons and swap simple labels (author/time) to measure VSR for your chosen judge model.

Blind provenance and timestamps in evaluation prompts. Re-run evaluations and compare selection rates to detect labeling bias.

Treat model explanations skeptically: add perturbation-based checks (swap labels, shuffle positions) rather than trusting rationales.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

ELI5 (ACL 2019)https://arxiv.org/abs/2507.00769

Risks & Boundaries

Limitations

Only two judge models tested (GPT-4o, Gemini-2.5-Flash); results may differ with other models.

Each dataset subsampled to 100 pairwise tasks; sample size limits generality.

When Not To Use

Do not rely solely on LLM-as-a-judge outputs for high-stakes decisions without bias checks.

Avoid using these evaluation prompts unchanged when provenance or timestamp metadata is available to the model.

Failure Modes

Unacknowledged cue-driven bias: verdicts driven by labels not content.

Unfaithful rationales: explanations that omit the true drivers of decisions.

Core Entities

Models

GPT-4oGemini-2.5-Flash

Metrics

Verdict Shift Rate (VSR)Cue Acknowledgment Rate (CAR)

Datasets

ELI5LitBench

Benchmarks

LitBench

Context Entities

Models

general-purpose conversational LLMs

Metrics

selection ratefirst-response selection rate

Datasets

long-form QA (ELI5)creative writing (LitBench)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Recency labels cause consistent selection shifts toward 'New' responses.

Provenance labels create a consistent trust hierarchy: EXPERT > HUMAN > LLM > UNKNOWN.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding