TOWER: open LLaMA-2 based multilingual models tuned for translation workflows and competitive with closed LLMs

Overview

Decision SnapshotReady For Pilot

The paper reports multi-benchmark numeric gains, ablations isolating parallel-data benefits, and releases models/datasets, so findings are well supported for translation tasks but limited by language coverage and missing GEC data.

Citations6

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 30%

Authors

Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pedro H. Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, José G. C. de Souza, André F. T. Martins

Links

Abstract / PDF

Why It Matters For Business

You can run an open 13B model that matches or beats other open models for translation and outperforms closed models on NER and post-editing in some settings, reducing vendor lock-in and inference cost while enabling customization.

Who Should Care

ML Engineer Data Scientist Engineering Lead CTO Product Manager

Summary TLDR

The authors adapt LLaMA-2 into a family of open multilingual models (TOWERBASE and TOWERINSTRUCT, 7B/13B) for translation workflows. They continue-pretrain LLaMA-2 on 20B tokens mixing monolingual and parallel sentences, then instruction-finetune on a curated dataset (TOWERBLOCKS). The 13B TOWERINSTRUCT matches or exceeds other open models on translation and often approaches GPT-4 quality on standard benchmarks; it also shines at automatic post-editing and multilingual NER. The paper releases models, the specialization dataset, and an evaluation framework.

Problem Statement

Open LLMs are often English-centric and lag behind closed models on multiple translation-related tasks. The paper asks: can we adapt an open base model to handle many translation workflow tasks at once and match closed LLM quality?

Main Contribution

A two-stage recipe: continued pretraining on a multilingual mix (monolingual + parallel) then instruction finetuning for translation tasks.

TOWERBASE (continued-pretrained LLaMA-2) and TOWERINSTRUCT (instruction-finetuned) in 7B and 13B sizes.

Key Findings

TOWERINSTRUCT-13B is the best open model for translation and is close to GPT-4 on standard benchmarks.

NumbersFLORES-200 COMET-22: TOWERINSTRUCT13B 88.88 vs GPT-4 89.13

Practical UseIf you need strong open-source translation, try TOWERINSTRUCT-13B first; it often matches closed LLM quality on evaluated benchmarks.

Evidence RefTable 1

Adding parallel sentences during continued pretraining boosts translation quality more than monolingual-only pretraining.

NumbersMixing monolingual+parallel yields ~+1 COMET-22 point and 85% of gains by 5B tokens

Practical UseWhen extending a base model for translation, include high-quality parallel data during continued pretraining rather than only monolingual text.

Evidence RefFigure 8 and Section 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Translation quality (COMET-22) on FLORES-200 (en → xx)	TOWERINSTRUCT-13B: 88.88; GPT-4: 89.13	GPT-4	-0.25	FLORES-200 (aggregated en→xx)	Table 1: aggregated COMET-22 scores	Table 1
Automatic post-editing (COMET-22) en → xx	TOWERINSTRUCT-13B: 83.31; Baseline (no edits): 76.80	Baseline (no edits)	+6.51	WMT23 APE (aggregated)	Table 3: APE aggregated results	Table 3

What To Try In 7 Days

Evaluate TOWERINSTRUCT-13B on your translation pipeline for en↔xx pairs.

If you build translation models, add cleaned parallel data to continued pretraining; test improvements on COMET-22.

Replace or augment post-edit steps with TOWERINSTRUCT for faster automatic post-editing and NER-based anonymization tests (measure time saved).

Optimization Features

Infra Optimization

DeepSpeed for model parallelism

System Optimization

bfloat16 mixed precision and packing during finetuning

Training Optimization

continued pretraining on domain-relevant multilingual mixtureinstruction finetuning (supervised) with mixed zero-/few-shot templates

Inference Optimization

MBR decoding with COMET-22 improves translation quality over greedy

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Training covers 10 languages only; performance outside them is untested.

GEC data absent from TOWERBLOCKS, so grammatical correction remains average.

When Not To Use

When you need best-available GEC quality without additional tuning.

When working on languages not included in the 10-language pretraining mix.

Failure Modes

Underperformance on languages or domains not covered by the continued pretraining corpus.

Conservative editing may miss necessary corrections compared to more aggressive editors.

Core Entities

Models

TOWERINSTRUCT-13BTOWERINSTRUCT-7BTOWERBASE-13BTOWERBASE-7BLLaMA-2 (backbone)NLLB-54BALMA-R

Metrics

COMET-22XCOMETCOMETKIWI-22BLEURTCHRFEdit Rate (ER)Sequence F1ERRANT

Datasets

TOWERBLOCKSContinued pretraining corpus (20B tokens mix)FLORES-200WMT23TICO-19MultiCoNEROPUS

Benchmarks

TOWEREVALFLORES-200WMT23TICO-19

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

TOWERINSTRUCT-13B is the best open model for translation and is close to GPT-4 on standard benchmarks.

Adding parallel sentences during continued pretraining boosts translation quality more than monolingual-only pretraining.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

BiasLab: a multilingual, dual-framing toolkit for robust output-level bias audits

Key finding

Decouple concepts from language: an MoE design that keeps strong multilingual accuracy and cuts token costs

Key finding

EthioLLM: open multilingual LLMs and a new EthioBenchmark for five Ethiopian languages plus English

Key finding

MoZIP: a 3-part multilingual benchmark plus an IP-tuned 7B model to test how well LLMs handle patent and IP tasks

Key finding