Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
If you use LLMs to score content, alignment can make scores cluster on a few values and hide real quality differences. That reduces trust and can mislead model selection, A/B tests, and automated pipelines that depend on numeric scores.
Summary TLDR
When LLMs are aligned (instruction- or preference-tuned) they concentrate numeric evaluation scores on a few values (numerical bias). This hurts evaluator accuracy in many cases. Simple fixes—tuning temperature, calibrating distributions, and especially changing the prompt score range—reduce bias. Score-range tuning is the most effective in these experiments, but it is heuristic and task-specific.
Problem Statement
LLM-as-a-judge systems output numeric quality scores. After alignment, evaluators overuse specific numeric tokens (numerical bias). That clustering reduces the evaluator's ability to distinguish different-quality inputs and can lower correlation with human judgments.
Main Contribution
Showed alignment substantially increases numerical bias in open LLM evaluators by comparing pre- and post-alignment models across MTQE, GECQE, and LCP tasks.
Quantified the link between bias strength (kurtosis) and evaluation accuracy (Pearson r), reporting a strong negative relationship.
Tested three mitigation strategies—temperature scaling, distribution calibration, and prompt score-range adjustment—and found score-range tuning most often reduced bias and improved correlation.
Offered practical guidelines: use post-alignment models for accuracy but measure kurtosis to select less-biased evaluators and treat score range as a tunable hyperparameter.
Key Findings
Alignment (instruction/preference tuning) increases numerical bias in evaluator outputs.
Stronger numerical bias correlates with lower evaluation accuracy on tested tasks.
Distribution calibration and temperature scaling reduce bias but give mixed gains in accuracy.
Changing the prompt score range often reduces bias and can improve correlation more than other fixes.
Bias strength depends on model and language; high-resource languages show stronger bias.
Results
Kurtosis increase after alignment (example)
Accuracy
Effect of calibration on kurtosis (example)
Score-range tuning can raise correlation (example)
Temperature tuning (optimal temps found)
Who Should Care
What To Try In 7 Days
Measure kurtosis of your evaluator's numeric outputs on a validation set; high kurtosis suggests numerical bias.
Treat the prompt score range as a tunable hyperparameter. Try at least 0–9, 1–5, and 1–100 and pick the one that lowers kurtosis and improves correlation on val data.
Test higher temperature and generative calibration as quick, low-cost fixes and compare accuracy on held-out human labels.
Reproducibility
License
- Mistral and Qwen2: Apache-2.0. Gemma and Llama-3 distributed under provider/lic-
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Used public models whose alignment details are undisclosed, so causes of bias in alignment data are not fully analyzable.
- Study limited to numeric-score evaluation; results may not generalize to text or rank outputs.
- Both pre- and post-alignment models must emit numeric tokens, excluding some LLMs from analysis.
- Mitigations (score range tuning) are heuristic and task-specific; general solutions remain open.
When Not To Use
- If your evaluator produces natural-language labels or ranks, these numeric-focused findings may not apply.
- If you cannot run validation tests with human labels, selecting an optimal score range is risky without proxy metrics.
Failure Modes
- Score-range tuning can reduce kurtosis but produce random outputs if range mismatches model behavior.
- Calibration or temperature changes may reduce bias but also reduce accuracy for some models.
- Using kurtosis alone can mislead if dataset-level kurtosis hides important per-example failures.
Core Entities
Models
- gemma-7b
- gemma-7b-it
- Mistral-7B-v0.1
- Mistral-7B-Instruct-v0.1
- Meta-Llama-3-8B
- Meta-Llama-3-8B-Instruct
- Qwen2-7B
- Qwen2-7B-Instruct
Metrics
- kurtosis (score distribution sharpness)
- Pearson correlation coefficient r (model vs human scores)
- mode ratio (example-level bias indicator)
Datasets
- WMT QE 2020 (MTQE DA sentence-level)
- WMT QE 2021 (dev/train for calibration)
- TMUGFM (English GECQE)
- autoJQE (FLUTEC and TECJL Japanese GECQE)
- MLSP2024 (LCP)
Benchmarks
- MTQE (machine translation quality estimation)
- GECQE (grammatical error correction quality estimation)
- LCP (lexical complexity prediction)

