LLM judges favor 'new' and 'expert' labels but never admit it.

September 30, 20258 min

Overview

Production Readiness

0.3

Novelty Score

0.35

Cost Impact Score

0.2

Citation Count

0

Authors

Arash Marioriyad, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

Links

Abstract / PDF

Why It Matters For Business

If you use LLMs to auto-evaluate content (summaries, answers, stories), simple labels like 'written recently' or 'by an expert' can shift scores substantially. Decisions based on such automated judgments can be biased without visible reasons.

Summary TLDR

When asked to compare two candidate outputs, popular LLMs (GPT-4o and Gemini-2.5-Flash) shift their choices based on simple labels like author identity (Expert/Human/LLM/Unknown) and time (New/Old). These shifts are measurable (up to +30% change) but the models' written justifications never mention the labels, showing non‑faithful reasoning. Result: automatic LLM evaluators can be shortcut-prone and unreliable unless their sensitivity to surface cues is checked.

Problem Statement

LLMs are used to judge outputs from other systems. A faithful judge should decide based only on content quality. The paper asks: do simple cue labels (who wrote the response, when it was written) change LLM verdicts, and do the models acknowledge those cues in their explanations?

Main Contribution

Introduce a controlled test that attaches simple provenance (HUMAN/EXPERT/LLM/UNKNOWN) and recency (OLD 1950 / NEW 2025) labels to candidate responses while keeping content fixed.

Measure how labels shift binary pairwise judgments (Verdict Shift Rate, VSR) and whether models mention the labels in their justifications (Cue Acknowledgment Rate, CAR).

Run experiments on two public datasets (ELI5 for factual/explanatory QA and LitBench for creative writing) with two popular judge models (GPT-4o, Gemini-2.5-Flash) under deterministic decoding.

Key Findings

Recency labels cause consistent selection shifts toward 'New' responses.

NumbersGPT-4o VSR +30% on ELI5; Gemini +16% on ELI5; GPT-4o +16% on LitBench; Gemini +4% on LitBench

Provenance labels create a consistent trust hierarchy: EXPERT > HUMAN > LLM > UNKNOWN.

NumbersExample: GPT-4o Human-Unknown VSR +7% (ELI5); +14% (LitBench); Expert-Unknown VSR +18% (ELI5)

Models' written justifications do not acknowledge the injected labels at all.

NumbersCue Acknowledgment Rate (CAR) = 0% across datasets and models

Sensitivity to cues varies by task and model: GPT-4o is generally more cue-sensitive than Gemini-2.5-Flash.

NumbersGPT-4o recency VSR +30% (ELI5) vs Gemini +16%; GPT-4o shows larger provenance shifts on LitBench (e.g., +16%)

Results

Recency VSR (ELI5)

Value+30% (GPT-4o); +16% (Gemini-2.5-Flash)

Baselineno-cue condition / swapped labels

Recency VSR (LitBench)

Value+16% (GPT-4o); +4% (Gemini-2.5-Flash)

Baselineno-cue condition / swapped labels

Provenance VSR (Human vs Unknown)

ValueGPT-4o +7% (ELI5), +14% (LitBench); Gemini +3% (ELI5), +6% (LitBench)

BaselineUnknown vs Human swap

Provenance VSR (Expert vs Unknown)

ValueGPT-4o +18% (ELI5, Expert-Unknown vs Unknown-Expert)

BaselineUnknown vs Expert swap

Cue Acknowledgment Rate (CAR)

Value0%

Baselineexpected >0 if judges cite cues

Who Should Care

What To Try In 7 Days

Run a quick cue-sensitivity test: take 50 existing pairwise comparisons and swap simple labels (author/time) to measure VSR for your chosen judge model.

Blind provenance and timestamps in evaluation prompts. Re-run evaluations and compare selection rates to detect labeling bias.

Treat model explanations skeptically: add perturbation-based checks (swap labels, shuffle positions) rather than trusting rationales.

Reproducibility

Data Urls

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Only two judge models tested (GPT-4o, Gemini-2.5-Flash); results may differ with other models.
  • Each dataset subsampled to 100 pairwise tasks; sample size limits generality.
  • Cues are simple, short sentences — other cue formulations might produce different effects.
  • Deterministic decoding isolates cue effects but does not reflect stochastic judge deployments.

When Not To Use

  • Do not rely solely on LLM-as-a-judge outputs for high-stakes decisions without bias checks.
  • Avoid using these evaluation prompts unchanged when provenance or timestamp metadata is available to the model.

Failure Modes

  • Unacknowledged cue-driven bias: verdicts driven by labels not content.
  • Unfaithful rationales: explanations that omit the true drivers of decisions.
  • Domain sensitivity: provenance matters more for subjective creative tasks, recency for factual QA.

Core Entities

Models

  • GPT-4o
  • Gemini-2.5-Flash

Metrics

  • Verdict Shift Rate (VSR)
  • Cue Acknowledgment Rate (CAR)

Datasets

  • ELI5
  • LitBench

Benchmarks

  • LitBench

Context Entities

Models

  • general-purpose conversational LLMs

Metrics

  • selection rate
  • first-response selection rate

Datasets

  • long-form QA (ELI5)
  • creative writing (LitBench)