DIBJUDGE: fine-tune judges to separate true quality signals from translation artifacts

March 11, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

Hongbin Zhang, Kehai Chen, Xuefen Bai, Youcheng Pan, Yang Xiang, Jinpeng Wang, Min Zhang

Links

Abstract / PDF

Why It Matters For Business

Automated multilingual evaluation can systematically over-score machine-translated outputs, harming product QA and model selection—DIBJUDGE cuts that bias while improving accuracy, lowering the risk of deploying models tuned to translation artifacts.

Summary TLDR

The paper identifies a systematic "translationese bias" where LLM judges prefer machine-translated text over human-authored references—especially in low-resource languages. It introduces DIBJUDGE, a fine-tuning recipe that (1) splits representations into a compressed, judgment-critical channel and a dedicated bias channel, (2) trains proxy tasks to capture translationese signals, and (3) penalizes cross-covariance between channels. Across multilingual reward benchmarks and a translationese test suite, DIBJUDGE raises accuracy and sharply reduces bias without hurting English performance.

Problem Statement

LLM-based evaluators often prefer machine-translated (translationese) text over human-authored references. This bias grows in low-resource languages and undermines multilingual evaluation. The problem: how to remove spurious signals (alignment to English and high model predictability) while keeping judgment utility.

Main Contribution

Characterize 'translationese bias' in LLM judges and link it to two spurious factors: latent alignment to English and cross-lingual predictability.

Propose DIBJUDGE: a disentangled variational information-bottleneck fine-tuning method that (a) compresses a robust channel, (b) routes spurious signals into a bias channel, and (c) penalizes channel dependence via a cross-covariance term.

Introduce two proxy tasks—cross-lingual alignment contrastive learning and log-probability bin classification—to explicitly capture translationese signals during training.

Show consistent empirical gains: improved multilingual reward-model accuracy and large reductions in measured translationese bias on multiple benchmarks.

Key Findings

DIBJUDGE improves multilingual reward-model accuracy and sets a new open-weight SOTA on m-RewardBench.

Numbersm-RewardBench: DIBJUDGE-Qwen3-8B 91.37 ±0.22 vs Qwen3-8B 86.12 ±0.52 (Table 1)

DIBJUDGE substantially reduces translationese bias across datasets and resource tiers.

NumbersAverage bias reductions reported: 80% (BELEBELE), 56% (AYA), 75% (XL-SUM) relative to vanilla SFT (Figure 4)

Latent disentanglement is verifiable: the bias channel encodes translationese while the robust channel is invariant.

NumbersLinear probes: bias rep accuracy 96.1% vs robust rep 50.3% (Table 21)

Results

Accuracy

Value91.37 ± 0.22

BaselineQwen3-8B 86.12 ± 0.52

Accuracy

Value91.01 ± 0.20

BaselineQwen3-8B 88.81 ± 0.48

Accuracy

Value87.53 ± 0.28

BaselineQwen3-8B 82.20 ± 0.60

Average translationese bias reduction

ValueBELEBELE 80%; AYA 56%; XL-SUM 75%

BaselineVanilla SFT (reference)

Latent disentanglement (linear probe)

ValueBias rep probe 96.1% / Robust rep probe 50.3%

BaselineBaseline SFT embedding probe 82.4%

Who Should Care

What To Try In 7 Days

Run pairwise judge checks on your multilingual eval sets and compute Bias Severity (S_bias) between human and back-translated candidates.

Measure CAD and SSR per language to see if judge preferences correlate with English alignment or predictability.

Prototype a LoRA-based fine-tune that adds a small bias encoder, a robust encoder with a KL bottleneck, and a cross-covariance penalty on a subset of languages.

Agent Features

Tool Use

  • LoRA

Optimization Features

Infra Optimization

  • 8 × NVIDIA H20 GPUs, single-node training

Model Optimization

  • LoRA

System Optimization

  • DeepSpeed ZeRO Stage 3 with CPU offload
  • FlashAttention-2 used for training speed

Training Optimization

  • Variational Information Bottleneck (KL regularizer)
  • Cross-covariance penalty for disentanglement
  • Proxy tasks: contrastive CLA and log-probability bin classification

Reproducibility

Code Urls

  • anonymous code repo mentioned in paper (no public URL provided)

Data Urls

  • M-RewardBench (public benchmark)
  • MM-Eval (public benchmark)
  • RewardBench (public benchmark)
  • BELEBELE, AYA, XL-SUM (public datasets referenced)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires additional encoder heads and proxy tasks; increases fine-tuning complexity and engineering cost.
  • Assumes Gaussian-like latent statistics for cross-covariance surrogate; may degrade if assumption fails.
  • Relies on back-translation to construct translationese negatives; quality and style of translator influence proxy task signals.

When Not To Use

  • When you only evaluate monolingual English systems with no translation artifacts.
  • On extremely small models where added projection heads and stochastic bottlenecks dominate capacity.
  • If you cannot afford the extra training complexity or lack labeled multilingual reward pairs.

Failure Modes

  • Over-compression (large β) can remove task-relevant semantics and harm accuracy.
  • Proxy tasks may miss other spurious biases, letting new shortcuts persist.
  • Cross-covariance penalty may underperform if latent distributions deviate strongly from Gaussian.

Core Entities

Models

  • Qwen3-8B
  • Qwen3-4B
  • GPT-4o
  • Gemini-2.5-Flash
  • Nemotron-Multi-49B
  • mR3-Qwen3-8B
  • DIBJUDGE-Qwen3-8B
  • DIBJUDGE-Qwen3-4B

Metrics

  • Accuracy
  • Bias Severity (S_bias)
  • Cross-lingual Alignment Discrepancy (CAD)
  • Sequence Surprisal Ratio (SSR)
  • Language Alignment Score (LAS)
  • Cross-lingual Sequence Surprisal (CSS)

Datasets

  • M-RewardBench
  • MM-Eval
  • RewardBench
  • BELEBELE
  • AYA
  • XL-SUM
  • Skywork-RewardPreference-80K

Benchmarks

  • m-RewardBench (avg 23 langs)
  • RewardBench (English)
  • MM-Eval (avg 18 langs)
  • Translationese bias suite (BELEBELE, AYA, XL-SUM)

Context Entities

Models

  • Qwen2.5
  • Qwen2.5-7B
  • M-PROMETHEUS
  • Think-as-Locals
  • Gemma-3
  • Llama-3

Metrics

  • BLEU (for back-translation quality)
  • Spearman rank correlation (for length bias)

Datasets

  • XL-Sum
  • BELEBELE (parallel reading comprehension)

Benchmarks

  • RewardBench family (including M-RewardBench, MM-Eval)