Overview
The paper reports multi-benchmark numeric gains, ablations isolating parallel-data benefits, and releases models/datasets, so findings are well supported for translation tasks but limited by language coverage and missing GEC data.
Citations6
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 30%
Why It Matters For Business
You can run an open 13B model that matches or beats other open models for translation and outperforms closed models on NER and post-editing in some settings, reducing vendor lock-in and inference cost while enabling customization.
Who Should Care
Summary TLDR
The authors adapt LLaMA-2 into a family of open multilingual models (TOWERBASE and TOWERINSTRUCT, 7B/13B) for translation workflows. They continue-pretrain LLaMA-2 on 20B tokens mixing monolingual and parallel sentences, then instruction-finetune on a curated dataset (TOWERBLOCKS). The 13B TOWERINSTRUCT matches or exceeds other open models on translation and often approaches GPT-4 quality on standard benchmarks; it also shines at automatic post-editing and multilingual NER. The paper releases models, the specialization dataset, and an evaluation framework.
Problem Statement
Open LLMs are often English-centric and lag behind closed models on multiple translation-related tasks. The paper asks: can we adapt an open base model to handle many translation workflow tasks at once and match closed LLM quality?
Main Contribution
A two-stage recipe: continued pretraining on a multilingual mix (monolingual + parallel) then instruction finetuning for translation tasks.
TOWERBASE (continued-pretrained LLaMA-2) and TOWERINSTRUCT (instruction-finetuned) in 7B and 13B sizes.
Key Findings
TOWERINSTRUCT-13B is the best open model for translation and is close to GPT-4 on standard benchmarks.
Adding parallel sentences during continued pretraining boosts translation quality more than monolingual-only pretraining.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Translation quality (COMET-22) on FLORES-200 (en → xx) | TOWERINSTRUCT-13B: 88.88; GPT-4: 89.13 | GPT-4 | -0.25 | FLORES-200 (aggregated en→xx) | Table 1: aggregated COMET-22 scores | Table 1 |
| Automatic post-editing (COMET-22) en → xx | TOWERINSTRUCT-13B: 83.31; Baseline (no edits): 76.80 | Baseline (no edits) | +6.51 | WMT23 APE (aggregated) | Table 3: APE aggregated results | Table 3 |
What To Try In 7 Days
Evaluate TOWERINSTRUCT-13B on your translation pipeline for en↔xx pairs.
If you build translation models, add cleaned parallel data to continued pretraining; test improvements on COMET-22.
Replace or augment post-edit steps with TOWERINSTRUCT for faster automatic post-editing and NER-based anonymization tests (measure time saved).
Optimization Features
Infra Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Training covers 10 languages only; performance outside them is untested.
GEC data absent from TOWERBLOCKS, so grammatical correction remains average.
When Not To Use
When you need best-available GEC quality without additional tuning.
When working on languages not included in the 10-language pretraining mix.
Failure Modes
Underperformance on languages or domains not covered by the continued pretraining corpus.
Conservative editing may miss necessary corrections compared to more aggressive editors.

