Overview
Human evaluation and automatic metrics both back improvements, but gains depend on span quality and original translation quality.
Citations1
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
xTower turns span-level error tags into human-readable explanations and targeted corrections, improving automated editing accuracy and saving post-editing time when integrated into MT QA pipelines.
Who Should Care
Summary TLDR
xTower is a 13B multilingual LLM finetuned to produce free-text explanations for marked translation error spans and to use those explanations to generate corrected translations. Human raters find explanations mostly related (≈4.3/6 for human spans) and helpful for understanding errors (≈4.5/6). Prompting xTower with error spans and explanations raises automatic quality metrics (COMET) by ~1–3 points and fixes ~80–84% of highlighted spans; a hybrid rule that only applies xTower to low-quality originals further improves results.
Problem Statement
MT systems still produce meaningful errors. Existing automatic metrics highlight bad spans but usually don't explain them in human language or use those explanations to fix translations. Practitioners need a single tool that explains span-level errors and helps produce corrected translations without always requiring a reference.
Main Contribution
A distilled, finetuned 13B multilingual LLM (xTower) that generates free-text explanations for annotated error spans and outputs corrected translations.
Large-scale distillation dataset using GPT-4 over WMT MQM-annotated samples (33k samples, 63k human spans) and combined MT data for finetuning.
Key Findings
Explanations are rated more related when spans are human-annotated than when predicted by an automatic detector.
Annotators find explanations helpful for understanding errors but less decisive for writing corrected text.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| COMET (EN→DE, referenceless, XCOMET spans) | 81.3 (xTower) vs 78.4 (original) → +2.9 | Original MT | +2.9 | WMT23 MQM test (EN-DE) | Table 5, referenceless predicted spans | Table 5 |
| COMET (HE→EN, referenceless, XCOMET spans) | 78.5 (xTower) vs 77.5 (original) → +1.0 | Original MT | +1.0 | WMT23 MQM test (HE-EN) | Table 5, referenceless predicted spans | Table 5 |
What To Try In 7 Days
Run XCOMET to mark spans and prompt xTower for explanations on a sample of poor-quality MT outputs.
Measure COMET/COMETKIWI before and after xTower corrections to estimate uplift and ROI.
Implement the hybrid rule: only call xTower when COMETKIWI < tuned threshold to cut costs.
Optimization Features
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
xTower depends on an external span detector (XCOMET) — noisy spans reduce explanation quality and downstream gains.
Evaluation focuses on a few language pairs with MQM data; results may not generalize to all languages or domains.
When Not To Use
When you lack reliable span annotations or a robust span detector.
For high-quality original translations where edits risk degrading text (COMET > ~80).
Failure Modes
Over-editing good translations and lowering quality for already-strong outputs.
Producing plausible but incorrect explanations that mislead post-editors.

