Smaller, faster NLLB-based models for 15 African language pairs, with released data and code

February 10, 20268 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Yasmin Moslem, Aman Kassahun Wassie, Amanuel Gizachew Abebe

Links

Abstract / PDF

Why It Matters For Business

AfriNLLB delivers translation models that are 20–57% faster at inference while keeping similar quality, making deployment in constrained environments (limited GPU or server cost) more affordable and easier.

Summary TLDR

AfriNLLB compresses NLLB-200 600M via iterative layer pruning and float16 quantization, then recovers quality with multi-stage fine-tuning and knowledge distillation. The authors curate and filter parallel data for 15 language pairs (mostly African), train pruned models (notably a 548M version), and show average translation quality comparable to the baseline while delivering 20–57% faster inference. They release models, code, and training data to enable practical deployment in resource-constrained settings.

Problem Statement

African languages lack compact, deployable translation models and consolidated parallel datasets. Large multilingual models exist but are heavy to run; collecting and cleaning African parallel data is scattered and time-consuming. AfriNLLB aims to make accurate, efficient translation models for African languages and publish the data and code.

Main Contribution

Curated and filtered parallel corpora for 15 language pairs focused on African languages (final training set ~1.6M samples).

Applied iterative layer pruning to NLLB-200 600M to build smaller models (e.g., 548M) and restored quality via fine-tuning and knowledge distillation from NLLB-200 3.3B.

Demonstrated inference speedups (≈23% for pruned; ≈57% with FP16) while keeping translation quality comparable on Flores200.

Released models in Transformers and CTranslate2 formats, plus code and training data to support reuse.

Key Findings

Iterative pruning produced a 548M model that runs faster than baseline.

NumbersThroughput +23% (pruned) and +57% (pruned + FP16) vs NLLB-600M

Average translation quality was comparable or slightly improved after fine-tuning and distillation.

NumbersAvg BLEU 26.21 → 27.05 (+3.2%) on evaluated directions

Training data and distillation volumes.

NumbersFinal training set ~1.6M samples; distillation data 568k segments

Pruning strategy matters: iterative pruning beats middle-layer removal.

NumbersIterative pruning yields better chrF++ and speed-quality tradeoff across ablations

Results

Average BLEU (all evaluated directions)

ValueBaseline 26.21 → AfriNLLB 27.05

BaselineNLLB-200 600M

Throughput (tokens/sec)

ValueBaseline 1469.96 → Pruned 1807.61 → Pruned+FP16 3513.32

BaselineNLLB-200 600M

Model size after pruning

Value548M (12 encoder / 8 decoder, 4 decoder layers removed)

Baseline600M

Training data used (after filtering and sampling)

ValueFinal: ~1.6M samples (3.2M bidirectional after reversing)

Who Should Care

What To Try In 7 Days

Download the AfriNLLB CTranslate2 model and test inference latency on your GPU to measure real speedups.

Fine-tune the 548M Transformers checkpoint on a small in-domain sample to check quality recovery for your domain.

Use the authors' filtering pipeline (language ID + semantic + QE) to quickly clean parallel data for an African language of interest.

Optimization Features

Token Efficiency

  • Higher throughput tokens/sec after pruning and FP16

Infra Optimization

  • Single A40 48GB GPU training and evaluation pipelines reported

Model Optimization

  • Iterative layer pruning (greedy layer importance evaluation)
  • Decoder-focused pruning (keeps encoder layers intact where tested)
  • FP16 quantization for faster inference

System Optimization

  • Released both Transformers (trainable) and CTranslate2 (inference) formats

Training Optimization

  • Multi-stage fine-tuning (pre-prune then post-prune)
  • Sequence-level knowledge distillation from NLLB-200 3.3B

Inference Optimization

  • CTranslate2 runtime for batched, efficient decoding
  • Beam size 3 and large token batches (1024 tokens) in evaluation

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Covers 15 language pairs only; many African languages remain unsupported.
  • Evaluation uses FLORES-200; domain mismatch may affect real-world behavior.
  • Encoder pruning was not fully explored and may degrade quality if used aggressively.
  • Quantization tested only as FP16; no lower-bit quantization or CPU-only benchmarks reported.

When Not To Use

  • When highest possible translation quality is required for languages not in the 15 supported pairs.
  • On devices that require sub-FP16 quantization or very low-memory footprints (edge/CPU only) without further testing.
  • For domains not represented in the curated training sources without domain adaptation.

Failure Modes

  • Pruning can reduce quality for some language directions, especially when encoder layers are removed.
  • Semantic filtering depends on available embedding models; Lingala lacked semantic filtering support.
  • Distillation and fine-tuning may still leave brittle outputs for low-resource directions with sparse data.

Core Entities

Models

  • NLLB-200 600M (baseline)
  • NLLB-200 3.3B (teacher)
  • AfriNLLB 548M (pruned + FT)
  • AfriNLLB variants: 12-8 (548M), 12-6, 12-4, 8-8 (498M)

Metrics

  • BLEU
  • chrF++
  • COMET
  • AfriCOMET
  • Throughput (tokens/s)
  • Inference time (s)

Datasets

  • Combined curated parallel data (~1.6M samples)
  • Distilled data (568k segments)
  • OPUS sources
  • Hugging Face datasets
  • FLORES-200 (dev/devtest for validation and test)

Benchmarks

  • FLORES-200