Alignment makes LLM evaluators overuse certain scores; prompt score range is a cheap fix

January 23, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.5

Citation Count

0

Authors

Ayako Sato, Hwichan Kim, Zhousi Chen, Masato Mita, Mamoru Komachi

Links

Abstract / PDF

Why It Matters For Business

If you use LLMs to score content, alignment can make scores cluster on a few values and hide real quality differences. That reduces trust and can mislead model selection, A/B tests, and automated pipelines that depend on numeric scores.

Summary TLDR

When LLMs are aligned (instruction- or preference-tuned) they concentrate numeric evaluation scores on a few values (numerical bias). This hurts evaluator accuracy in many cases. Simple fixes—tuning temperature, calibrating distributions, and especially changing the prompt score range—reduce bias. Score-range tuning is the most effective in these experiments, but it is heuristic and task-specific.

Problem Statement

LLM-as-a-judge systems output numeric quality scores. After alignment, evaluators overuse specific numeric tokens (numerical bias). That clustering reduces the evaluator's ability to distinguish different-quality inputs and can lower correlation with human judgments.

Main Contribution

Showed alignment substantially increases numerical bias in open LLM evaluators by comparing pre- and post-alignment models across MTQE, GECQE, and LCP tasks.

Quantified the link between bias strength (kurtosis) and evaluation accuracy (Pearson r), reporting a strong negative relationship.

Tested three mitigation strategies—temperature scaling, distribution calibration, and prompt score-range adjustment—and found score-range tuning most often reduced bias and improved correlation.

Offered practical guidelines: use post-alignment models for accuracy but measure kurtosis to select less-biased evaluators and treat score range as a tunable hyperparameter.

Key Findings

Alignment (instruction/preference tuning) increases numerical bias in evaluator outputs.

NumbersExample: Gemma En-De kurtosis 0.27 (pre) → 128.17 (post)

Stronger numerical bias correlates with lower evaluation accuracy on tested tasks.

NumbersKurtosis vs Pearson r correlation ≈ -0.60 (MTQE and GECQE)

Distribution calibration and temperature scaling reduce bias but give mixed gains in accuracy.

NumbersMistral MTQE kurtosis 52.92 → 3.39 after calibration; r changed 0.198 → 0.147

Changing the prompt score range often reduces bias and can improve correlation more than other fixes.

NumbersGemma MTQE r: 0.08 (0–9) → 0.14 (1–100); kurtosis reduced vs default

Bias strength depends on model and language; high-resource languages show stronger bias.

NumbersPost-alignment kurtosis notably higher for En-De, En-Zh than for Ne-En, Si-En

Results

Kurtosis increase after alignment (example)

ValueGemma En-De kurtosis 0.27 (pre) → 128.17 (post)

Baselinepre-alignment kurtosis 0.27

Accuracy

ValuePearson corr between kurtosis and r ≈ -0.60

Effect of calibration on kurtosis (example)

ValueMistral MTQE kurtosis 52.92 → 3.39 after calibration

Baseline52.92

Score-range tuning can raise correlation (example)

ValueGemma MTQE r: 0.08 (0–9) → 0.14 (1–100)

Baseliner = 0.08 with 0–9

Temperature tuning (optimal temps found)

ValueOptimal temperature: Gemma 1.0, others 0.7 (default)

Baselinedefault temperatures used in experiments

Who Should Care

What To Try In 7 Days

Measure kurtosis of your evaluator's numeric outputs on a validation set; high kurtosis suggests numerical bias.

Treat the prompt score range as a tunable hyperparameter. Try at least 0–9, 1–5, and 1–100 and pick the one that lowers kurtosis and improves correlation on val data.

Test higher temperature and generative calibration as quick, low-cost fixes and compare accuracy on held-out human labels.

Reproducibility

License

  • Mistral and Qwen2: Apache-2.0. Gemma and Llama-3 distributed under provider/lic-

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Used public models whose alignment details are undisclosed, so causes of bias in alignment data are not fully analyzable.
  • Study limited to numeric-score evaluation; results may not generalize to text or rank outputs.
  • Both pre- and post-alignment models must emit numeric tokens, excluding some LLMs from analysis.
  • Mitigations (score range tuning) are heuristic and task-specific; general solutions remain open.

When Not To Use

  • If your evaluator produces natural-language labels or ranks, these numeric-focused findings may not apply.
  • If you cannot run validation tests with human labels, selecting an optimal score range is risky without proxy metrics.

Failure Modes

  • Score-range tuning can reduce kurtosis but produce random outputs if range mismatches model behavior.
  • Calibration or temperature changes may reduce bias but also reduce accuracy for some models.
  • Using kurtosis alone can mislead if dataset-level kurtosis hides important per-example failures.

Core Entities

Models

  • gemma-7b
  • gemma-7b-it
  • Mistral-7B-v0.1
  • Mistral-7B-Instruct-v0.1
  • Meta-Llama-3-8B
  • Meta-Llama-3-8B-Instruct
  • Qwen2-7B
  • Qwen2-7B-Instruct

Metrics

  • kurtosis (score distribution sharpness)
  • Pearson correlation coefficient r (model vs human scores)
  • mode ratio (example-level bias indicator)

Datasets

  • WMT QE 2020 (MTQE DA sentence-level)
  • WMT QE 2021 (dev/train for calibration)
  • TMUGFM (English GECQE)
  • autoJQE (FLUTEC and TECJL Japanese GECQE)
  • MLSP2024 (LCP)

Benchmarks

  • MTQE (machine translation quality estimation)
  • GECQE (grammatical error correction quality estimation)
  • LCP (lexical complexity prediction)