Alignment makes LLM evaluators overuse certain scores; prompt score range is a cheap fix

January 23, 20268 min

Overview

Decision SnapshotNeeds Validation

Solid empirical analysis across multiple open models and datasets. Uses clear metrics (kurtosis, Pearson r). Limited by public-model unknowns and task scope.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Partial

License: Mistral and Qwen2: Apache-2.0. Gemma and Llama-3 distributed under provider/lic-

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Ayako Sato, Hwichan Kim, Zhousi Chen, Masato Mita, Mamoru Komachi

Links

Abstract / PDF / Data

Why It Matters For Business

If you use LLMs to score content, alignment can make scores cluster on a few values and hide real quality differences. That reduces trust and can mislead model selection, A/B tests, and automated pipelines that depend on numeric scores.

Who Should Care

Summary TLDR

When LLMs are aligned (instruction- or preference-tuned) they concentrate numeric evaluation scores on a few values (numerical bias). This hurts evaluator accuracy in many cases. Simple fixes—tuning temperature, calibrating distributions, and especially changing the prompt score range—reduce bias. Score-range tuning is the most effective in these experiments, but it is heuristic and task-specific.

Problem Statement

LLM-as-a-judge systems output numeric quality scores. After alignment, evaluators overuse specific numeric tokens (numerical bias). That clustering reduces the evaluator's ability to distinguish different-quality inputs and can lower correlation with human judgments.

Main Contribution

Showed alignment substantially increases numerical bias in open LLM evaluators by comparing pre- and post-alignment models across MTQE, GECQE, and LCP tasks.

Quantified the link between bias strength (kurtosis) and evaluation accuracy (Pearson r), reporting a strong negative relationship.

Key Findings

Alignment (instruction/preference tuning) increases numerical bias in evaluator outputs.

NumbersExample: Gemma En-De kurtosis 0.27 (pre) → 128.17 (post)

Practical UseExpect aligned evaluator models to cluster scores. If you see concentrated outputs, try a less-aligned model or mitigation (below).

Evidence RefTable 3, Figures 2–3

Stronger numerical bias correlates with lower evaluation accuracy on tested tasks.

NumbersKurtosis vs Pearson r correlation ≈ -0.60 (MTQE and GECQE)

Practical UseUse kurtosis as a quick proxy for evaluator reliability when you lack gold labels. Prefer models with lower kurtosis.

Evidence RefTables 3–4 (kurtosis and r), section 3.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Kurtosis increase after alignment (example)Gemma En-De kurtosis 0.27 (pre) → 128.17 (post)pre-alignment kurtosis 0.27+127.90MTQE En-DeTable 3 shows large kurtosis increases after alignmentTable 3
AccuracyPearson corr between kurtosis and r ≈ -0.60MTQE and GECQE (dataset-level)Section 3.3 reports a -0.60 correlation indicating higher bias → lower rSection 3.3, Tables 3–4

What To Try In 7 Days

Measure kurtosis of your evaluator's numeric outputs on a validation set; high kurtosis suggests numerical bias.

Treat the prompt score range as a tunable hyperparameter. Try at least 0–9, 1–5, and 1–100 and pick the one that lowers kurtosis and improves correlation on val data.

Test higher temperature and generative calibration as quick, low-cost fixes and compare accuracy on held-out human labels.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseMistral and Qwen2: Apache-2.0. Gemma and Llama-3 distributed under provider/lic-

Risks & Boundaries

Limitations

Used public models whose alignment details are undisclosed, so causes of bias in alignment data are not fully analyzable.

Study limited to numeric-score evaluation; results may not generalize to text or rank outputs.

When Not To Use

If your evaluator produces natural-language labels or ranks, these numeric-focused findings may not apply.

If you cannot run validation tests with human labels, selecting an optimal score range is risky without proxy metrics.

Failure Modes

Score-range tuning can reduce kurtosis but produce random outputs if range mismatches model behavior.

Calibration or temperature changes may reduce bias but also reduce accuracy for some models.

Core Entities

Models

gemma-7bgemma-7b-itMistral-7B-v0.1Mistral-7B-Instruct-v0.1Meta-Llama-3-8BMeta-Llama-3-8B-InstructQwen2-7BQwen2-7B-Instruct

Metrics

kurtosis (score distribution sharpness)Pearson correlation coefficient r (model vs human scores)mode ratio (example-level bias indicator)

Datasets

WMT QE 2020 (MTQE DA sentence-level)WMT QE 2021 (dev/train for calibration)TMUGFM (English GECQE)autoJQE (FLUTEC and TECJL Japanese GECQE)MLSP2024 (LCP)

Benchmarks

MTQE (machine translation quality estimation)GECQE (grammatical error correction quality estimation)LCP (lexical complexity prediction)