DIBJUDGE: fine-tune judges to separate true quality signals from translation artifacts

Overview

Decision SnapshotReady For Pilot

The method shows clear experimental benefits and ablations; it requires extra training components (proxy tasks, KL terms) and modest infra, so apply on evaluation pipelines where multilingual fairness matters.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Hongbin Zhang, Kehai Chen, Xuefen Bai, Youcheng Pan, Yang Xiang, Jinpeng Wang, Min Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automated multilingual evaluation can systematically over-score machine-translated outputs, harming product QA and model selection—DIBJUDGE cuts that bias while improving accuracy, lowering the risk of deploying models tuned to translation artifacts.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

The paper identifies a systematic "translationese bias" where LLM judges prefer machine-translated text over human-authored references—especially in low-resource languages. It introduces DIBJUDGE, a fine-tuning recipe that (1) splits representations into a compressed, judgment-critical channel and a dedicated bias channel, (2) trains proxy tasks to capture translationese signals, and (3) penalizes cross-covariance between channels. Across multilingual reward benchmarks and a translationese test suite, DIBJUDGE raises accuracy and sharply reduces bias without hurting English performance.

Problem Statement

LLM-based evaluators often prefer machine-translated (translationese) text over human-authored references. This bias grows in low-resource languages and undermines multilingual evaluation. The problem: how to remove spurious signals (alignment to English and high model predictability) while keeping judgment utility.

Main Contribution

Characterize 'translationese bias' in LLM judges and link it to two spurious factors: latent alignment to English and cross-lingual predictability.

Propose DIBJUDGE: a disentangled variational information-bottleneck fine-tuning method that (a) compresses a robust channel, (b) routes spurious signals into a bias channel, and (c) penalizes channel dependence via a cross-covariance term.

Key Findings

DIBJUDGE improves multilingual reward-model accuracy and sets a new open-weight SOTA on m-RewardBench.

Numbersm-RewardBench: DIBJUDGE-Qwen3-8B 91.37 ±0.22 vs Qwen3-8B 86.12 ±0.52 (Table 1)

Practical UseIf you fine-tune judges with the DIB objective, expect measurable accuracy gains on multilingual reward tasks versus the same base model fine-tuned normally.

Evidence RefTable 1

DIBJUDGE substantially reduces translationese bias across datasets and resource tiers.

NumbersAverage bias reductions reported: 80% (BELEBELE), 56% (AYA), 75% (XL-SUM) relative to vanilla SFT (Figure 4)

Practical UseApplying DIBJUDGE will lower the rate at which your judge wrongly favors machine-translated text—especially useful when evaluating low-resource languages.

Evidence RefFigure 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	91.37 ± 0.22	Qwen3-8B 86.12 ± 0.52	+5.25 pts	m-RewardBench	Table 1: DIBJUDGE-Qwen3-8B vs Qwen3-8B	Table 1
Accuracy	91.01 ± 0.20	Qwen3-8B 88.81 ± 0.48	+2.20 pts	RewardBench (English)	Table 1: DIBJUDGE-Qwen3-8B	Table 1

What To Try In 7 Days

Run pairwise judge checks on your multilingual eval sets and compute Bias Severity (S_bias) between human and back-translated candidates.

Measure CAD and SSR per language to see if judge preferences correlate with English alignment or predictability.

Prototype a LoRA-based fine-tune that adds a small bias encoder, a robust encoder with a KL bottleneck, and a cross-covariance penalty on a subset of languages.

Agent Features

Tool Use

LoRA

Optimization Features

Infra Optimization

8 × NVIDIA H20 GPUs, single-node training

Model Optimization

LoRA

System Optimization

DeepSpeed ZeRO Stage 3 with CPU offloadFlashAttention-2 used for training speed

Training Optimization

Variational Information Bottleneck (KL regularizer)Cross-covariance penalty for disentanglementProxy tasks: contrastive CLA and log-probability bin classification

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

anonymous code repo mentioned in paper (no public URL provided)

Data URLs

M-RewardBench (public benchmark)MM-Eval (public benchmark)RewardBench (public benchmark)BELEBELE, AYA, XL-SUM (public datasets referenced)

Risks & Boundaries

Limitations

Requires additional encoder heads and proxy tasks; increases fine-tuning complexity and engineering cost.

Assumes Gaussian-like latent statistics for cross-covariance surrogate; may degrade if assumption fails.

When Not To Use

When you only evaluate monolingual English systems with no translation artifacts.

On extremely small models where added projection heads and stochastic bottlenecks dominate capacity.

Failure Modes

Over-compression (large β) can remove task-relevant semantics and harm accuracy.

Proxy tasks may miss other spurious biases, letting new shortcuts persist.

Core Entities

Models

Qwen3-8BQwen3-4BGPT-4oGemini-2.5-FlashNemotron-Multi-49BmR3-Qwen3-8BDIBJUDGE-Qwen3-8BDIBJUDGE-Qwen3-4B

Metrics

AccuracyBias Severity (S_bias)Cross-lingual Alignment Discrepancy (CAD)Sequence Surprisal Ratio (SSR)Language Alignment Score (LAS)Cross-lingual Sequence Surprisal (CSS)

Datasets

M-RewardBenchMM-EvalRewardBenchBELEBELEAYAXL-SUMSkywork-RewardPreference-80K

Benchmarks

m-RewardBench (avg 23 langs)RewardBench (English)MM-Eval (avg 18 langs)Translationese bias suite (BELEBELE, AYA, XL-SUM)

Context Entities

Models

Qwen2.5Qwen2.5-7BM-PROMETHEUSThink-as-LocalsGemma-3Llama-3

Metrics

BLEU (for back-translation quality)Spearman rank correlation (for length bias)

Datasets

XL-SumBELEBELE (parallel reading comprehension)

Benchmarks

RewardBench family (including M-RewardBench, MM-Eval)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

DIBJUDGE improves multilingual reward-model accuracy and sets a new open-weight SOTA on m-RewardBench.

DIBJUDGE substantially reduces translationese bias across datasets and resource tiers.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding