DIBJUDGE: fine-tune judges to separate true quality signals from translation artifacts

March 11, 20267 min

Overview

Decision SnapshotReady For Pilot

The method shows clear experimental benefits and ablations; it requires extra training components (proxy tasks, KL terms) and modest infra, so apply on evaluation pipelines where multilingual fairness matters.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Hongbin Zhang, Kehai Chen, Xuefen Bai, Youcheng Pan, Yang Xiang, Jinpeng Wang, Min Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automated multilingual evaluation can systematically over-score machine-translated outputs, harming product QA and model selection—DIBJUDGE cuts that bias while improving accuracy, lowering the risk of deploying models tuned to translation artifacts.

Who Should Care

Summary TLDR

The paper identifies a systematic "translationese bias" where LLM judges prefer machine-translated text over human-authored references—especially in low-resource languages. It introduces DIBJUDGE, a fine-tuning recipe that (1) splits representations into a compressed, judgment-critical channel and a dedicated bias channel, (2) trains proxy tasks to capture translationese signals, and (3) penalizes cross-covariance between channels. Across multilingual reward benchmarks and a translationese test suite, DIBJUDGE raises accuracy and sharply reduces bias without hurting English performance.

Problem Statement

LLM-based evaluators often prefer machine-translated (translationese) text over human-authored references. This bias grows in low-resource languages and undermines multilingual evaluation. The problem: how to remove spurious signals (alignment to English and high model predictability) while keeping judgment utility.

Main Contribution

Characterize 'translationese bias' in LLM judges and link it to two spurious factors: latent alignment to English and cross-lingual predictability.

Propose DIBJUDGE: a disentangled variational information-bottleneck fine-tuning method that (a) compresses a robust channel, (b) routes spurious signals into a bias channel, and (c) penalizes channel dependence via a cross-covariance term.

Key Findings

DIBJUDGE improves multilingual reward-model accuracy and sets a new open-weight SOTA on m-RewardBench.

Numbersm-RewardBench: DIBJUDGE-Qwen3-8B 91.37 ±0.22 vs Qwen3-8B 86.12 ±0.52 (Table 1)

Practical UseIf you fine-tune judges with the DIB objective, expect measurable accuracy gains on multilingual reward tasks versus the same base model fine-tuned normally.

Evidence RefTable 1

DIBJUDGE substantially reduces translationese bias across datasets and resource tiers.

NumbersAverage bias reductions reported: 80% (BELEBELE), 56% (AYA), 75% (XL-SUM) relative to vanilla SFT (Figure 4)

Practical UseApplying DIBJUDGE will lower the rate at which your judge wrongly favors machine-translated text—especially useful when evaluating low-resource languages.

Evidence RefFigure 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy91.37 ± 0.22Qwen3-8B 86.12 ± 0.52+5.25 ptsm-RewardBenchTable 1: DIBJUDGE-Qwen3-8B vs Qwen3-8BTable 1
Accuracy91.01 ± 0.20Qwen3-8B 88.81 ± 0.48+2.20 ptsRewardBench (English)Table 1: DIBJUDGE-Qwen3-8BTable 1

What To Try In 7 Days

Run pairwise judge checks on your multilingual eval sets and compute Bias Severity (S_bias) between human and back-translated candidates.

Measure CAD and SSR per language to see if judge preferences correlate with English alignment or predictability.

Prototype a LoRA-based fine-tune that adds a small bias encoder, a robust encoder with a KL bottleneck, and a cross-covariance penalty on a subset of languages.

Agent Features

Tool Use
LoRA

Optimization Features

Infra Optimization
8 × NVIDIA H20 GPUs, single-node training
Model Optimization
LoRA
System Optimization
DeepSpeed ZeRO Stage 3 with CPU offloadFlashAttention-2 used for training speed
Training Optimization
Variational Information Bottleneck (KL regularizer)Cross-covariance penalty for disentanglementProxy tasks: contrastive CLA and log-probability bin classification

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Code URLs

anonymous code repo mentioned in paper (no public URL provided)

Data URLs

M-RewardBench (public benchmark)MM-Eval (public benchmark)RewardBench (public benchmark)BELEBELE, AYA, XL-SUM (public datasets referenced)

Risks & Boundaries

Limitations

Requires additional encoder heads and proxy tasks; increases fine-tuning complexity and engineering cost.

Assumes Gaussian-like latent statistics for cross-covariance surrogate; may degrade if assumption fails.

When Not To Use

When you only evaluate monolingual English systems with no translation artifacts.

On extremely small models where added projection heads and stochastic bottlenecks dominate capacity.

Failure Modes

Over-compression (large β) can remove task-relevant semantics and harm accuracy.

Proxy tasks may miss other spurious biases, letting new shortcuts persist.

Core Entities

Models

Qwen3-8BQwen3-4BGPT-4oGemini-2.5-FlashNemotron-Multi-49BmR3-Qwen3-8BDIBJUDGE-Qwen3-8BDIBJUDGE-Qwen3-4B

Metrics

AccuracyBias Severity (S_bias)Cross-lingual Alignment Discrepancy (CAD)Sequence Surprisal Ratio (SSR)Language Alignment Score (LAS)Cross-lingual Sequence Surprisal (CSS)

Datasets

M-RewardBenchMM-EvalRewardBenchBELEBELEAYAXL-SUMSkywork-RewardPreference-80K

Benchmarks

m-RewardBench (avg 23 langs)RewardBench (English)MM-Eval (avg 18 langs)Translationese bias suite (BELEBELE, AYA, XL-SUM)

Context Entities

Models

Qwen2.5Qwen2.5-7BM-PROMETHEUSThink-as-LocalsGemma-3Llama-3

Metrics

BLEU (for back-translation quality)Spearman rank correlation (for length bias)

Datasets

XL-SumBELEBELE (parallel reading comprehension)

Benchmarks

RewardBench family (including M-RewardBench, MM-Eval)