TOWER: open LLaMA-2 based multilingual models tuned for translation workflows and competitive with closed LLMs

February 27, 20247 min

Overview

Decision SnapshotReady For Pilot

The paper reports multi-benchmark numeric gains, ablations isolating parallel-data benefits, and releases models/datasets, so findings are well supported for translation tasks but limited by language coverage and missing GEC data.

Citations6

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 30%

Authors

Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pedro H. Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, José G. C. de Souza, André F. T. Martins

Links

Abstract / PDF

Why It Matters For Business

You can run an open 13B model that matches or beats other open models for translation and outperforms closed models on NER and post-editing in some settings, reducing vendor lock-in and inference cost while enabling customization.

Who Should Care

Summary TLDR

The authors adapt LLaMA-2 into a family of open multilingual models (TOWERBASE and TOWERINSTRUCT, 7B/13B) for translation workflows. They continue-pretrain LLaMA-2 on 20B tokens mixing monolingual and parallel sentences, then instruction-finetune on a curated dataset (TOWERBLOCKS). The 13B TOWERINSTRUCT matches or exceeds other open models on translation and often approaches GPT-4 quality on standard benchmarks; it also shines at automatic post-editing and multilingual NER. The paper releases models, the specialization dataset, and an evaluation framework.

Problem Statement

Open LLMs are often English-centric and lag behind closed models on multiple translation-related tasks. The paper asks: can we adapt an open base model to handle many translation workflow tasks at once and match closed LLM quality?

Main Contribution

A two-stage recipe: continued pretraining on a multilingual mix (monolingual + parallel) then instruction finetuning for translation tasks.

TOWERBASE (continued-pretrained LLaMA-2) and TOWERINSTRUCT (instruction-finetuned) in 7B and 13B sizes.

Key Findings

TOWERINSTRUCT-13B is the best open model for translation and is close to GPT-4 on standard benchmarks.

NumbersFLORES-200 COMET-22: TOWERINSTRUCT13B 88.88 vs GPT-4 89.13

Practical UseIf you need strong open-source translation, try TOWERINSTRUCT-13B first; it often matches closed LLM quality on evaluated benchmarks.

Evidence RefTable 1

Adding parallel sentences during continued pretraining boosts translation quality more than monolingual-only pretraining.

NumbersMixing monolingual+parallel yields ~+1 COMET-22 point and 85% of gains by 5B tokens

Practical UseWhen extending a base model for translation, include high-quality parallel data during continued pretraining rather than only monolingual text.

Evidence RefFigure 8 and Section 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Translation quality (COMET-22) on FLORES-200 (en → xx)TOWERINSTRUCT-13B: 88.88; GPT-4: 89.13GPT-4-0.25FLORES-200 (aggregated en→xx)Table 1: aggregated COMET-22 scoresTable 1
Automatic post-editing (COMET-22) en → xxTOWERINSTRUCT-13B: 83.31; Baseline (no edits): 76.80Baseline (no edits)+6.51WMT23 APE (aggregated)Table 3: APE aggregated resultsTable 3

What To Try In 7 Days

Evaluate TOWERINSTRUCT-13B on your translation pipeline for en↔xx pairs.

If you build translation models, add cleaned parallel data to continued pretraining; test improvements on COMET-22.

Replace or augment post-edit steps with TOWERINSTRUCT for faster automatic post-editing and NER-based anonymization tests (measure time saved).

Optimization Features

Infra Optimization
DeepSpeed for model parallelism
System Optimization
bfloat16 mixed precision and packing during finetuning
Training Optimization
continued pretraining on domain-relevant multilingual mixtureinstruction finetuning (supervised) with mixed zero-/few-shot templates
Inference Optimization
MBR decoding with COMET-22 improves translation quality over greedy

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Training covers 10 languages only; performance outside them is untested.

GEC data absent from TOWERBLOCKS, so grammatical correction remains average.

When Not To Use

When you need best-available GEC quality without additional tuning.

When working on languages not included in the 10-language pretraining mix.

Failure Modes

Underperformance on languages or domains not covered by the continued pretraining corpus.

Conservative editing may miss necessary corrections compared to more aggressive editors.

Core Entities

Models

TOWERINSTRUCT-13BTOWERINSTRUCT-7BTOWERBASE-13BTOWERBASE-7BLLaMA-2 (backbone)NLLB-54BALMA-R

Metrics

COMET-22XCOMETCOMETKIWI-22BLEURTCHRFEdit Rate (ER)Sequence F1ERRANT

Datasets

TOWERBLOCKSContinued pretraining corpus (20B tokens mix)FLORES-200WMT23TICO-19MultiCoNEROPUS

Benchmarks

TOWEREVALFLORES-200WMT23TICO-19