xTower: an LLM that explains translation errors and suggests fixes

June 27, 20247 min

Overview

Decision SnapshotNeeds Validation

Human evaluation and automatic metrics both back improvements, but gains depend on span quality and original translation quality.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Marcos Treviso, Nuno M. Guerreiro, Sweta Agrawal, Ricardo Rei, José Pombal, Tania Vaz, Helena Wu, Beatriz Silva, Daan van Stigt, André F. T. Martins

Links

Abstract / PDF / Code / Data

Why It Matters For Business

xTower turns span-level error tags into human-readable explanations and targeted corrections, improving automated editing accuracy and saving post-editing time when integrated into MT QA pipelines.

Who Should Care

Summary TLDR

xTower is a 13B multilingual LLM finetuned to produce free-text explanations for marked translation error spans and to use those explanations to generate corrected translations. Human raters find explanations mostly related (≈4.3/6 for human spans) and helpful for understanding errors (≈4.5/6). Prompting xTower with error spans and explanations raises automatic quality metrics (COMET) by ~1–3 points and fixes ~80–84% of highlighted spans; a hybrid rule that only applies xTower to low-quality originals further improves results.

Problem Statement

MT systems still produce meaningful errors. Existing automatic metrics highlight bad spans but usually don't explain them in human language or use those explanations to fix translations. Practitioners need a single tool that explains span-level errors and helps produce corrected translations without always requiring a reference.

Main Contribution

A distilled, finetuned 13B multilingual LLM (xTower) that generates free-text explanations for annotated error spans and outputs corrected translations.

Large-scale distillation dataset using GPT-4 over WMT MQM-annotated samples (33k samples, 63k human spans) and combined MT data for finetuning.

Key Findings

Explanations are rated more related when spans are human-annotated than when predicted by an automatic detector.

NumbersRelatedness (6-point): human spans ≈ 4.3, XCOMET spans ≈ 3.2

Practical UsePrefer human or higher-quality span detectors when you need precise explanations; expect lower explanation quality if spans are noisy.

Evidence RefTable 2; §4.2

Annotators find explanations helpful for understanding errors but less decisive for writing corrected text.

NumbersHelpfulness Q1 (error understanding) ≈ 4.5, Q2 (guidance) ≈ 3.33.9 (6-point)

Practical UseUse xTower explanations to speed diagnosis and triage. Do not expect fully automatic post-editing from explanations alone; human editors still required for final fix.

Evidence RefTable 3; §4.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
COMET (EN→DE, referenceless, XCOMET spans)81.3 (xTower) vs 78.4 (original) → +2.9Original MT+2.9WMT23 MQM test (EN-DE)Table 5, referenceless predicted spansTable 5
COMET (HE→EN, referenceless, XCOMET spans)78.5 (xTower) vs 77.5 (original) → +1.0Original MT+1.0WMT23 MQM test (HE-EN)Table 5, referenceless predicted spansTable 5

What To Try In 7 Days

Run XCOMET to mark spans and prompt xTower for explanations on a sample of poor-quality MT outputs.

Measure COMET/COMETKIWI before and after xTower corrections to estimate uplift and ROI.

Implement the hybrid rule: only call xTower when COMETKIWI < tuned threshold to cut costs.

Optimization Features

Training Optimization
distillation from GPT-4mixed prompt finetuning (zero-/few-shot)

Reproducibility

Risks & Boundaries

Limitations

xTower depends on an external span detector (XCOMET) — noisy spans reduce explanation quality and downstream gains.

Evaluation focuses on a few language pairs with MQM data; results may not generalize to all languages or domains.

When Not To Use

When you lack reliable span annotations or a robust span detector.

For high-quality original translations where edits risk degrading text (COMET > ~80).

Failure Modes

Over-editing good translations and lowering quality for already-strong outputs.

Producing plausible but incorrect explanations that mislead post-editors.

Core Entities

Models

xTower 13BTOWERBASE 13BTOWERINSTRUCT 13BMixtral 8x7BGPT-3.5 TurboGPT-4XCOMET-XL

Metrics

COMETBLEURTCOMETKIWIBLEUchrF

Datasets

WMT 2022 MQMWMT 2023 MQMTOWERBLOCKS (Unbabel dataset)

Benchmarks

WMT Metrics shared task