TOWER: open LLaMA-2 based multilingual models tuned for translation workflows and competitive with closed LLMs

February 27, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.3

Cost Impact Score

0.6

Citation Count

6

Authors

Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pedro H. Martins, João Alves, Amin Farajian, Ben Peters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, José G. C. de Souza, André F. T. Martins

Links

Abstract / PDF

Why It Matters For Business

You can run an open 13B model that matches or beats other open models for translation and outperforms closed models on NER and post-editing in some settings, reducing vendor lock-in and inference cost while enabling customization.

Summary TLDR

The authors adapt LLaMA-2 into a family of open multilingual models (TOWERBASE and TOWERINSTRUCT, 7B/13B) for translation workflows. They continue-pretrain LLaMA-2 on 20B tokens mixing monolingual and parallel sentences, then instruction-finetune on a curated dataset (TOWERBLOCKS). The 13B TOWERINSTRUCT matches or exceeds other open models on translation and often approaches GPT-4 quality on standard benchmarks; it also shines at automatic post-editing and multilingual NER. The paper releases models, the specialization dataset, and an evaluation framework.

Problem Statement

Open LLMs are often English-centric and lag behind closed models on multiple translation-related tasks. The paper asks: can we adapt an open base model to handle many translation workflow tasks at once and match closed LLM quality?

Main Contribution

A two-stage recipe: continued pretraining on a multilingual mix (monolingual + parallel) then instruction finetuning for translation tasks.

TOWERBASE (continued-pretrained LLaMA-2) and TOWERINSTRUCT (instruction-finetuned) in 7B and 13B sizes.

TOWERBLOCKS: a curated, high-quality instruction dataset for translation-related tasks.

TOWEREVAL: an evaluation framework and benchmark suite for translation workflows, and released model outputs for reproducibility.

Ablations showing parallel data in continued pretraining and dataset composition materially affect translation and downstream task performance.

Key Findings

TOWERINSTRUCT-13B is the best open model for translation and is close to GPT-4 on standard benchmarks.

NumbersFLORES-200 COMET-22: TOWERINSTRUCT13B 88.88 vs GPT-4 89.13

Adding parallel sentences during continued pretraining boosts translation quality more than monolingual-only pretraining.

NumbersMixing monolingual+parallel yields ~+1 COMET-22 point and 85% of gains by 5B tokens

TOWERINSTRUCT strongly improves post-editing and named-entity recognition compared to open baselines.

NumbersAPE en→xx COMET-22: TOWER13B 83.31 vs baseline 76.80; NER F1: TOWER13B 74.70 vs GPT-4 59.88

TOWERINSTRUCT edits less often than GPT-4 but still improves quality.

NumbersEdit rate: GPT-4 ≈90% of instances edited vs TOWERINSTRUCT ≈30%

GEC (grammatical error correction) performance is average and did not improve over baselines.

NumbersNo model significantly outperforms others on GEC across languages in this study

Results

Translation quality (COMET-22) on FLORES-200 (en → xx)

ValueTOWERINSTRUCT-13B: 88.88; GPT-4: 89.13

BaselineGPT-4

Automatic post-editing (COMET-22) en → xx

ValueTOWERINSTRUCT-13B: 83.31; Baseline (no edits): 76.80

BaselineBaseline (no edits)

Named entity recognition (Sequence F1, multilingual)

ValueTOWERINSTRUCT-13B: 74.70; GPT-4: 59.88

BaselineGPT-4

Who Should Care

What To Try In 7 Days

Evaluate TOWERINSTRUCT-13B on your translation pipeline for en↔xx pairs.

If you build translation models, add cleaned parallel data to continued pretraining; test improvements on COMET-22.

Replace or augment post-edit steps with TOWERINSTRUCT for faster automatic post-editing and NER-based anonymization tests (measure time saved).

Optimization Features

Infra Optimization

  • DeepSpeed for model parallelism

System Optimization

  • bfloat16 mixed precision and packing during finetuning

Training Optimization

  • continued pretraining on domain-relevant multilingual mixture
  • instruction finetuning (supervised) with mixed zero-/few-shot templates

Inference Optimization

  • MBR decoding with COMET-22 improves translation quality over greedy

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Training covers 10 languages only; performance outside them is untested.
  • GEC data absent from TOWERBLOCKS, so grammatical correction remains average.
  • Performance drops relative to GPT-4 on longer sentences and some metrics.
  • Ablations use limited compute budgets; larger continued pretraining may change trade-offs.

When Not To Use

  • When you need best-available GEC quality without additional tuning.
  • When working on languages not included in the 10-language pretraining mix.
  • When you require the absolute top translation quality on very long-context documents.

Failure Modes

  • Underperformance on languages or domains not covered by the continued pretraining corpus.
  • Conservative editing may miss necessary corrections compared to more aggressive editors.
  • Potential hallucinations in generative tasks when domain-specific data is missing.

Core Entities

Models

  • TOWERINSTRUCT-13B
  • TOWERINSTRUCT-7B
  • TOWERBASE-13B
  • TOWERBASE-7B
  • LLaMA-2 (backbone)
  • NLLB-54B
  • ALMA-R

Metrics

  • COMET-22
  • XCOMET
  • COMETKIWI-22
  • BLEURT
  • CHRF
  • Edit Rate (ER)
  • Sequence F1
  • ERRANT

Datasets

  • TOWERBLOCKS
  • Continued pretraining corpus (20B tokens mix)
  • FLORES-200
  • WMT23
  • TICO-19
  • MultiCoNER
  • OPUS

Benchmarks

  • TOWEREVAL
  • FLORES-200
  • WMT23
  • TICO-19