Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
Automated multilingual evaluation can systematically over-score machine-translated outputs, harming product QA and model selection—DIBJUDGE cuts that bias while improving accuracy, lowering the risk of deploying models tuned to translation artifacts.
Summary TLDR
The paper identifies a systematic "translationese bias" where LLM judges prefer machine-translated text over human-authored references—especially in low-resource languages. It introduces DIBJUDGE, a fine-tuning recipe that (1) splits representations into a compressed, judgment-critical channel and a dedicated bias channel, (2) trains proxy tasks to capture translationese signals, and (3) penalizes cross-covariance between channels. Across multilingual reward benchmarks and a translationese test suite, DIBJUDGE raises accuracy and sharply reduces bias without hurting English performance.
Problem Statement
LLM-based evaluators often prefer machine-translated (translationese) text over human-authored references. This bias grows in low-resource languages and undermines multilingual evaluation. The problem: how to remove spurious signals (alignment to English and high model predictability) while keeping judgment utility.
Main Contribution
Characterize 'translationese bias' in LLM judges and link it to two spurious factors: latent alignment to English and cross-lingual predictability.
Propose DIBJUDGE: a disentangled variational information-bottleneck fine-tuning method that (a) compresses a robust channel, (b) routes spurious signals into a bias channel, and (c) penalizes channel dependence via a cross-covariance term.
Introduce two proxy tasks—cross-lingual alignment contrastive learning and log-probability bin classification—to explicitly capture translationese signals during training.
Show consistent empirical gains: improved multilingual reward-model accuracy and large reductions in measured translationese bias on multiple benchmarks.
Key Findings
DIBJUDGE improves multilingual reward-model accuracy and sets a new open-weight SOTA on m-RewardBench.
DIBJUDGE substantially reduces translationese bias across datasets and resource tiers.
Latent disentanglement is verifiable: the bias channel encodes translationese while the robust channel is invariant.
Results
Accuracy
Accuracy
Accuracy
Average translationese bias reduction
Latent disentanglement (linear probe)
Who Should Care
What To Try In 7 Days
Run pairwise judge checks on your multilingual eval sets and compute Bias Severity (S_bias) between human and back-translated candidates.
Measure CAD and SSR per language to see if judge preferences correlate with English alignment or predictability.
Prototype a LoRA-based fine-tune that adds a small bias encoder, a robust encoder with a KL bottleneck, and a cross-covariance penalty on a subset of languages.
Agent Features
Tool Use
- LoRA
Optimization Features
Infra Optimization
- 8 × NVIDIA H20 GPUs, single-node training
Model Optimization
- LoRA
System Optimization
- DeepSpeed ZeRO Stage 3 with CPU offload
- FlashAttention-2 used for training speed
Training Optimization
- Variational Information Bottleneck (KL regularizer)
- Cross-covariance penalty for disentanglement
- Proxy tasks: contrastive CLA and log-probability bin classification
Reproducibility
Code Urls
- anonymous code repo mentioned in paper (no public URL provided)
Data Urls
- M-RewardBench (public benchmark)
- MM-Eval (public benchmark)
- RewardBench (public benchmark)
- BELEBELE, AYA, XL-SUM (public datasets referenced)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires additional encoder heads and proxy tasks; increases fine-tuning complexity and engineering cost.
- Assumes Gaussian-like latent statistics for cross-covariance surrogate; may degrade if assumption fails.
- Relies on back-translation to construct translationese negatives; quality and style of translator influence proxy task signals.
When Not To Use
- When you only evaluate monolingual English systems with no translation artifacts.
- On extremely small models where added projection heads and stochastic bottlenecks dominate capacity.
- If you cannot afford the extra training complexity or lack labeled multilingual reward pairs.
Failure Modes
- Over-compression (large β) can remove task-relevant semantics and harm accuracy.
- Proxy tasks may miss other spurious biases, letting new shortcuts persist.
- Cross-covariance penalty may underperform if latent distributions deviate strongly from Gaussian.
Core Entities
Models
- Qwen3-8B
- Qwen3-4B
- GPT-4o
- Gemini-2.5-Flash
- Nemotron-Multi-49B
- mR3-Qwen3-8B
- DIBJUDGE-Qwen3-8B
- DIBJUDGE-Qwen3-4B
Metrics
- Accuracy
- Bias Severity (S_bias)
- Cross-lingual Alignment Discrepancy (CAD)
- Sequence Surprisal Ratio (SSR)
- Language Alignment Score (LAS)
- Cross-lingual Sequence Surprisal (CSS)
Datasets
- M-RewardBench
- MM-Eval
- RewardBench
- BELEBELE
- AYA
- XL-SUM
- Skywork-RewardPreference-80K
Benchmarks
- m-RewardBench (avg 23 langs)
- RewardBench (English)
- MM-Eval (avg 18 langs)
- Translationese bias suite (BELEBELE, AYA, XL-SUM)
Context Entities
Models
- Qwen2.5
- Qwen2.5-7B
- M-PROMETHEUS
- Think-as-Locals
- Gemma-3
- Llama-3
Metrics
- BLEU (for back-translation quality)
- Spearman rank correlation (for length bias)
Datasets
- XL-Sum
- BELEBELE (parallel reading comprehension)
Benchmarks
- RewardBench family (including M-RewardBench, MM-Eval)

