Overview
Production Readiness
0.65
Novelty Score
0.55
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
If you use LLMs to auto-evaluate models or run leaderboards, protocol choice affects rank integrity: pairwise setups can be gamed by tone or verbosity, while absolute scoring gives more stable signals for instruction-following tasks.
Summary TLDR
The paper defines 'distracted evaluation'—LLM evaluators favoring irrelevant style features (assertiveness, verbosity, sycophancy) over the instructed criterion. Across controlled (IF-Eval-TweakSet) and real (MT-Bench) tests, pairwise comparisons flip preferences far more often than absolute scores (≈35% vs ≈9% flip rate on evaluated benchmarks). Pairwise judgments also rarely output ties when responses are equal, while absolute scores do. Practical takeaway: prefer absolute scoring for instruction-following and low-signal tasks; use pairwise carefully and test for stylistic bias.
Problem Statement
LLM-based evaluation pipelines commonly collect either relative (pairwise) or absolute (pointwise) feedback, but little is known about how that choice itself biases judgments. The paper asks: does the feedback protocol change what LLM judges value, and can generator models exploit protocol weaknesses to inflate rankings?
Main Contribution
Define 'distracted evaluation': LLM judges favor irrelevant stylistic features (assertiveness, prolixity, sycophancy) over the instructed criterion.
Systematic experiments showing pairwise preferences are far more affected by distractors than absolute scoring, across controlled (IF-Eval-TweakSet) and natural (MT-Bench) data.
Demonstrate a concrete attack: simple stylistic edits (assertiveness) to low-quality responses can raise model Elo ranks under pairwise evaluation but not under absolute scoring.
Actionable guidance: recommendations for choosing feedback protocols based on dataset signal and evaluation goals.
Key Findings
Pairwise comparisons flip preferences when a distractor is added much more often than absolute scores.
When two responses are equivalent in quality, absolute scoring produces many identical scores while pairwise judgments almost never tie.
Generator models can climb leaderboard rankings by adding stylistic distractors under pairwise evaluation.
Results
Preference flip rate (distractor added)
Tie rate when responses are identical in quality
Accuracy
Who Should Care
What To Try In 7 Days
Run a small audit: take top/bottom examples and add assertive/verbose edits; measure preference flips under your current evaluator.
Switch a subset of evaluations from pairwise to 1–7 absolute scores and compare rank stability for your models.
Add a stylistic-check filter to leaderboard submissions to detect obvious assertive/verbose edits before re-ranking.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only studies two feedback protocols (pairwise and absolute), excluding n‑wise ranking and multi-dimension scoring.
- Experiments focus on instruction-following; effects on factuality or reasoning tasks are not measured.
- GPT-family models were excluded from some analyses due to logits access and dataset generation concerns.
- Distractors are limited to assertiveness, prolixity, and sycophancy; other stylistic or cultural traits may behave differently.
When Not To Use
- Avoid pairwise evaluations for low-signal tasks or datasets where stylistic differences are common.
- Avoid absolute scoring when fine-grained comparative distinctions are the core evaluation objective and raters are well-calibrated.
Failure Modes
- Pairwise protocols can amplify irrelevant stylistic features and force arbitrary distinctions.
- Intransitive or choice-set-sensitive preferences can arise under relative feedback, producing inconsistent rankings.
- Leaderboards based on pairwise comparisons can be gamed by optimizing tone rather than substance.
Core Entities
Models
- LLaMA3.2-3B-Instruct
- LLaMA3.3-70B-Instruct
- Qwen2.5-3B-Instruct
- Qwen2.5-72B-Instruct
- gpt-4-0613
- o3-mini-2025-01-31
- GPT-3.5-Mini
- GPT-4
- GPT-3.5
- Vicuna-13B
- Alpaca-13B
- LLaMA-13B
- Claude-v1
Metrics
- preference flip rate (%)
- tie rate (%)
- Accuracy
- Elo score shifts
Datasets
- IF-EVAL-TweakSet
- MT-Bench
- Ultrafeedback
- Helpsteer2
- IF-EVAL
Benchmarks
- MT-Bench
- IF-Eval

