Overview
Solid empirical analysis across multiple open models and datasets. Uses clear metrics (kurtosis, Pearson r). Limited by public-model unknowns and task scope.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/5
Reproducibility
Status: Partial assets available
Open source: Partial
License: Mistral and Qwen2: Apache-2.0. Gemma and Llama-3 distributed under provider/lic-
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
If you use LLMs to score content, alignment can make scores cluster on a few values and hide real quality differences. That reduces trust and can mislead model selection, A/B tests, and automated pipelines that depend on numeric scores.
Who Should Care
Summary TLDR
When LLMs are aligned (instruction- or preference-tuned) they concentrate numeric evaluation scores on a few values (numerical bias). This hurts evaluator accuracy in many cases. Simple fixes—tuning temperature, calibrating distributions, and especially changing the prompt score range—reduce bias. Score-range tuning is the most effective in these experiments, but it is heuristic and task-specific.
Problem Statement
LLM-as-a-judge systems output numeric quality scores. After alignment, evaluators overuse specific numeric tokens (numerical bias). That clustering reduces the evaluator's ability to distinguish different-quality inputs and can lower correlation with human judgments.
Main Contribution
Showed alignment substantially increases numerical bias in open LLM evaluators by comparing pre- and post-alignment models across MTQE, GECQE, and LCP tasks.
Quantified the link between bias strength (kurtosis) and evaluation accuracy (Pearson r), reporting a strong negative relationship.
Key Findings
Alignment (instruction/preference tuning) increases numerical bias in evaluator outputs.
Stronger numerical bias correlates with lower evaluation accuracy on tested tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Kurtosis increase after alignment (example) | Gemma En-De kurtosis 0.27 (pre) → 128.17 (post) | pre-alignment kurtosis 0.27 | +127.90 | MTQE En-De | Table 3 shows large kurtosis increases after alignment | Table 3 |
| Accuracy | Pearson corr between kurtosis and r ≈ -0.60 | — | — | MTQE and GECQE (dataset-level) | Section 3.3 reports a -0.60 correlation indicating higher bias → lower r | Section 3.3, Tables 3–4 |
What To Try In 7 Days
Measure kurtosis of your evaluator's numeric outputs on a validation set; high kurtosis suggests numerical bias.
Treat the prompt score range as a tunable hyperparameter. Try at least 0–9, 1–5, and 1–100 and pick the one that lowers kurtosis and improves correlation on val data.
Test higher temperature and generative calibration as quick, low-cost fixes and compare accuracy on held-out human labels.
Reproducibility
Risks & Boundaries
Limitations
Used public models whose alignment details are undisclosed, so causes of bias in alignment data are not fully analyzable.
Study limited to numeric-score evaluation; results may not generalize to text or rank outputs.
When Not To Use
If your evaluator produces natural-language labels or ranks, these numeric-focused findings may not apply.
If you cannot run validation tests with human labels, selecting an optimal score range is risky without proxy metrics.
Failure Modes
Score-range tuning can reduce kurtosis but produce random outputs if range mismatches model behavior.
Calibration or temperature changes may reduce bias but also reduce accuracy for some models.

