Alignment makes LLM evaluators overuse certain scores; prompt score range is a cheap fix

Overview

Decision SnapshotNeeds Validation

Solid empirical analysis across multiple open models and datasets. Uses clear metrics (kurtosis, Pearson r). Limited by public-model unknowns and task scope.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Partial

License: Mistral and Qwen2: Apache-2.0. Gemma and Llama-3 distributed under provider/lic-

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Ayako Sato, Hwichan Kim, Zhousi Chen, Masato Mita, Mamoru Komachi

Links

Abstract / PDF / Data

Why It Matters For Business

If you use LLMs to score content, alignment can make scores cluster on a few values and hide real quality differences. That reduces trust and can mislead model selection, A/B tests, and automated pipelines that depend on numeric scores.

Who Should Care

ML Engineer Data Scientist Engineering Lead Product Manager CTO

Summary TLDR

When LLMs are aligned (instruction- or preference-tuned) they concentrate numeric evaluation scores on a few values (numerical bias). This hurts evaluator accuracy in many cases. Simple fixes—tuning temperature, calibrating distributions, and especially changing the prompt score range—reduce bias. Score-range tuning is the most effective in these experiments, but it is heuristic and task-specific.

Problem Statement

LLM-as-a-judge systems output numeric quality scores. After alignment, evaluators overuse specific numeric tokens (numerical bias). That clustering reduces the evaluator's ability to distinguish different-quality inputs and can lower correlation with human judgments.

Main Contribution

Showed alignment substantially increases numerical bias in open LLM evaluators by comparing pre- and post-alignment models across MTQE, GECQE, and LCP tasks.

Quantified the link between bias strength (kurtosis) and evaluation accuracy (Pearson r), reporting a strong negative relationship.

Key Findings

Alignment (instruction/preference tuning) increases numerical bias in evaluator outputs.

NumbersExample: Gemma En-De kurtosis 0.27 (pre) → 128.17 (post)

Practical UseExpect aligned evaluator models to cluster scores. If you see concentrated outputs, try a less-aligned model or mitigation (below).

Evidence RefTable 3, Figures 2–3

Stronger numerical bias correlates with lower evaluation accuracy on tested tasks.

NumbersKurtosis vs Pearson r correlation ≈ -0.60 (MTQE and GECQE)

Practical UseUse kurtosis as a quick proxy for evaluator reliability when you lack gold labels. Prefer models with lower kurtosis.

Evidence RefTables 3–4 (kurtosis and r), section 3.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Kurtosis increase after alignment (example)	Gemma En-De kurtosis 0.27 (pre) → 128.17 (post)	pre-alignment kurtosis 0.27	+127.90	MTQE En-De	Table 3 shows large kurtosis increases after alignment	Table 3
Accuracy	Pearson corr between kurtosis and r ≈ -0.60	—	—	MTQE and GECQE (dataset-level)	Section 3.3 reports a -0.60 correlation indicating higher bias → lower r	Section 3.3, Tables 3–4

What To Try In 7 Days

Measure kurtosis of your evaluator's numeric outputs on a validation set; high kurtosis suggests numerical bias.

Treat the prompt score range as a tunable hyperparameter. Try at least 0–9, 1–5, and 1–100 and pick the one that lowers kurtosis and improves correlation on val data.

Test higher temperature and generative calibration as quick, low-cost fixes and compare accuracy on held-out human labels.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseMistral and Qwen2: Apache-2.0. Gemma and Llama-3 distributed under provider/lic-

Data URLs

https://github.com/WMT-QE-Task/wmt-qe-2022-data https://huggingface.co/datasets/tmu-nlp/tmu_gfm_dataset https://github.com/tmu-nlp/autoJQE https://huggingface.co/datasets/MLSP2024/MLSP2024

Risks & Boundaries

Limitations

Used public models whose alignment details are undisclosed, so causes of bias in alignment data are not fully analyzable.

Study limited to numeric-score evaluation; results may not generalize to text or rank outputs.

When Not To Use

If your evaluator produces natural-language labels or ranks, these numeric-focused findings may not apply.

If you cannot run validation tests with human labels, selecting an optimal score range is risky without proxy metrics.

Failure Modes

Score-range tuning can reduce kurtosis but produce random outputs if range mismatches model behavior.

Calibration or temperature changes may reduce bias but also reduce accuracy for some models.

Core Entities

Models

gemma-7bgemma-7b-itMistral-7B-v0.1Mistral-7B-Instruct-v0.1Meta-Llama-3-8BMeta-Llama-3-8B-InstructQwen2-7BQwen2-7B-Instruct

Metrics

kurtosis (score distribution sharpness)Pearson correlation coefficient r (model vs human scores)mode ratio (example-level bias indicator)

Datasets

WMT QE 2020 (MTQE DA sentence-level)WMT QE 2021 (dev/train for calibration)TMUGFM (English GECQE)autoJQE (FLUTEC and TECJL Japanese GECQE)MLSP2024 (LCP)

Benchmarks

MTQE (machine translation quality estimation)GECQE (grammatical error correction quality estimation)LCP (lexical complexity prediction)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Alignment (instruction/preference tuning) increases numerical bias in evaluator outputs.

Stronger numerical bias correlates with lower evaluation accuracy on tested tasks.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding