xTower: an LLM that explains translation errors and suggests fixes

Overview

Decision SnapshotNeeds Validation

Human evaluation and automatic metrics both back improvements, but gains depend on span quality and original translation quality.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Marcos Treviso, Nuno M. Guerreiro, Sweta Agrawal, Ricardo Rei, José Pombal, Tania Vaz, Helena Wu, Beatriz Silva, Daan van Stigt, André F. T. Martins

Links

Abstract / PDF / Code / Data

Why It Matters For Business

xTower turns span-level error tags into human-readable explanations and targeted corrections, improving automated editing accuracy and saving post-editing time when integrated into MT QA pipelines.

Who Should Care

Product Manager ML Engineer Founder CEO

Summary TLDR

xTower is a 13B multilingual LLM finetuned to produce free-text explanations for marked translation error spans and to use those explanations to generate corrected translations. Human raters find explanations mostly related (≈4.3/6 for human spans) and helpful for understanding errors (≈4.5/6). Prompting xTower with error spans and explanations raises automatic quality metrics (COMET) by ~1–3 points and fixes ~80–84% of highlighted spans; a hybrid rule that only applies xTower to low-quality originals further improves results.

Problem Statement

MT systems still produce meaningful errors. Existing automatic metrics highlight bad spans but usually don't explain them in human language or use those explanations to fix translations. Practitioners need a single tool that explains span-level errors and helps produce corrected translations without always requiring a reference.

Main Contribution

A distilled, finetuned 13B multilingual LLM (xTower) that generates free-text explanations for annotated error spans and outputs corrected translations.

Large-scale distillation dataset using GPT-4 over WMT MQM-annotated samples (33k samples, 63k human spans) and combined MT data for finetuning.

Key Findings

Explanations are rated more related when spans are human-annotated than when predicted by an automatic detector.

NumbersRelatedness (6-point): human spans ≈ 4.3, XCOMET spans ≈ 3.2

Practical UsePrefer human or higher-quality span detectors when you need precise explanations; expect lower explanation quality if spans are noisy.

Evidence RefTable 2; §4.2

Annotators find explanations helpful for understanding errors but less decisive for writing corrected text.

NumbersHelpfulness Q1 (error understanding) ≈ 4.5, Q2 (guidance) ≈ 3.3–3.9 (6-point)

Practical UseUse xTower explanations to speed diagnosis and triage. Do not expect fully automatic post-editing from explanations alone; human editors still required for final fix.

Evidence RefTable 3; §4.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
COMET (EN→DE, referenceless, XCOMET spans)	81.3 (xTower) vs 78.4 (original) → +2.9	Original MT	+2.9	WMT23 MQM test (EN-DE)	Table 5, referenceless predicted spans	Table 5
COMET (HE→EN, referenceless, XCOMET spans)	78.5 (xTower) vs 77.5 (original) → +1.0	Original MT	+1.0	WMT23 MQM test (HE-EN)	Table 5, referenceless predicted spans	Table 5

What To Try In 7 Days

Run XCOMET to mark spans and prompt xTower for explanations on a sample of poor-quality MT outputs.

Measure COMET/COMETKIWI before and after xTower corrections to estimate uplift and ROI.

Implement the hybrid rule: only call xTower when COMETKIWI < tuned threshold to cut costs.

Optimization Features

Training Optimization

distillation from GPT-4mixed prompt finetuning (zero-/few-shot)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

http://huggingface.co/sardinelab/xTower13B

Data URLs

https://huggingface.co/datasets/Unbabel/https://www.statmt.org/wmt22/ (WMT data)https://www.statmt.org/wmt23/ (WMT data)https://huggingface.co/Unbabel/XCOMET-XL

Risks & Boundaries

Limitations

xTower depends on an external span detector (XCOMET) — noisy spans reduce explanation quality and downstream gains.

Evaluation focuses on a few language pairs with MQM data; results may not generalize to all languages or domains.

When Not To Use

When you lack reliable span annotations or a robust span detector.

For high-quality original translations where edits risk degrading text (COMET > ~80).

Failure Modes

Over-editing good translations and lowering quality for already-strong outputs.

Producing plausible but incorrect explanations that mislead post-editors.

Core Entities

Models

xTower 13BTOWERBASE 13BTOWERINSTRUCT 13BMixtral 8x7BGPT-3.5 TurboGPT-4XCOMET-XL

Metrics

COMETBLEURTCOMETKIWIBLEUchrF

Datasets

WMT 2022 MQMWMT 2023 MQMTOWERBLOCKS (Unbabel dataset)

Benchmarks

WMT Metrics shared task

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Explanations are rated more related when spans are human-annotated than when predicted by an automatic detector.

Annotators find explanations helpful for understanding errors but less decisive for writing corrected text.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

SciIG: a benchmark that asks LLMs to draft research-paper introductions from title, abstract, and related work

Key finding

PersonaLens: a large benchmark and LLM-based user+judge agents to measure personalization in task-oriented assistants

Key finding

Use simple entropy-based reweighting to make cheap model judges match human preferences.

Key finding