Overview
The method shows clear experimental benefits and ablations; it requires extra training components (proxy tasks, KL terms) and modest infra, so apply on evaluation pipelines where multilingual fairness matters.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Automated multilingual evaluation can systematically over-score machine-translated outputs, harming product QA and model selection—DIBJUDGE cuts that bias while improving accuracy, lowering the risk of deploying models tuned to translation artifacts.
Who Should Care
Summary TLDR
The paper identifies a systematic "translationese bias" where LLM judges prefer machine-translated text over human-authored references—especially in low-resource languages. It introduces DIBJUDGE, a fine-tuning recipe that (1) splits representations into a compressed, judgment-critical channel and a dedicated bias channel, (2) trains proxy tasks to capture translationese signals, and (3) penalizes cross-covariance between channels. Across multilingual reward benchmarks and a translationese test suite, DIBJUDGE raises accuracy and sharply reduces bias without hurting English performance.
Problem Statement
LLM-based evaluators often prefer machine-translated (translationese) text over human-authored references. This bias grows in low-resource languages and undermines multilingual evaluation. The problem: how to remove spurious signals (alignment to English and high model predictability) while keeping judgment utility.
Main Contribution
Characterize 'translationese bias' in LLM judges and link it to two spurious factors: latent alignment to English and cross-lingual predictability.
Propose DIBJUDGE: a disentangled variational information-bottleneck fine-tuning method that (a) compresses a robust channel, (b) routes spurious signals into a bias channel, and (c) penalizes channel dependence via a cross-covariance term.
Key Findings
DIBJUDGE improves multilingual reward-model accuracy and sets a new open-weight SOTA on m-RewardBench.
DIBJUDGE substantially reduces translationese bias across datasets and resource tiers.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 91.37 ± 0.22 | Qwen3-8B 86.12 ± 0.52 | +5.25 pts | m-RewardBench | Table 1: DIBJUDGE-Qwen3-8B vs Qwen3-8B | Table 1 |
| Accuracy | 91.01 ± 0.20 | Qwen3-8B 88.81 ± 0.48 | +2.20 pts | RewardBench (English) | Table 1: DIBJUDGE-Qwen3-8B | Table 1 |
What To Try In 7 Days
Run pairwise judge checks on your multilingual eval sets and compute Bias Severity (S_bias) between human and back-translated candidates.
Measure CAD and SSR per language to see if judge preferences correlate with English alignment or predictability.
Prototype a LoRA-based fine-tune that adds a small bias encoder, a robust encoder with a KL bottleneck, and a cross-covariance penalty on a subset of languages.
Agent Features
Tool Use
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Requires additional encoder heads and proxy tasks; increases fine-tuning complexity and engineering cost.
Assumes Gaussian-like latent statistics for cross-covariance surrogate; may degrade if assumption fails.
When Not To Use
When you only evaluate monolingual English systems with no translation artifacts.
On extremely small models where added projection heads and stochastic bottlenecks dominate capacity.
Failure Modes
Over-compression (large β) can remove task-relevant semantics and harm accuracy.
Proxy tasks may miss other spurious biases, letting new shortcuts persist.

