Smaller, faster NLLB-based models for 15 African language pairs, with released data and code

February 10, 20268 min

Overview

Decision SnapshotReady For Pilot

Models, data, and code are released; evaluation across 15 language pairs shows consistent speed gains and quality recovery when fine-tuned, but evaluation is limited to Flores200 and GPU-based inference.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Yasmin Moslem, Aman Kassahun Wassie, Amanuel Gizachew Abebe

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AfriNLLB delivers translation models that are 20–57% faster at inference while keeping similar quality, making deployment in constrained environments (limited GPU or server cost) more affordable and easier.

Who Should Care

Summary TLDR

AfriNLLB compresses NLLB-200 600M via iterative layer pruning and float16 quantization, then recovers quality with multi-stage fine-tuning and knowledge distillation. The authors curate and filter parallel data for 15 language pairs (mostly African), train pruned models (notably a 548M version), and show average translation quality comparable to the baseline while delivering 20–57% faster inference. They release models, code, and training data to enable practical deployment in resource-constrained settings.

Problem Statement

African languages lack compact, deployable translation models and consolidated parallel datasets. Large multilingual models exist but are heavy to run; collecting and cleaning African parallel data is scattered and time-consuming. AfriNLLB aims to make accurate, efficient translation models for African languages and publish the data and code.

Main Contribution

Curated and filtered parallel corpora for 15 language pairs focused on African languages (final training set ~1.6M samples).

Applied iterative layer pruning to NLLB-200 600M to build smaller models (e.g., 548M) and restored quality via fine-tuning and knowledge distillation from NLLB-200 3.3B.

Key Findings

Iterative pruning produced a 548M model that runs faster than baseline.

NumbersThroughput +23% (pruned) and +57% (pruned + FP16) vs NLLB-600M

Practical UseUse the 548M AfriNLLB model to cut inference time substantially on GPU without large quality loss.

Evidence RefTable 4; Table 7

Average translation quality was comparable or slightly improved after fine-tuning and distillation.

NumbersAvg BLEU 26.2127.05 (+3.2%) on evaluated directions

Practical UsePrune then fine-tune with distillation to keep or improve quality while saving compute.

Evidence RefTable 7 (Average row)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average BLEU (all evaluated directions)Baseline 26.21 → AfriNLLB 27.05NLLB-200 600M+3.2%Flores200 devtest / averaged directionsTable 7 (Average row)Table 7
Throughput (tokens/sec)Baseline 1469.96 → Pruned 1807.61 → Pruned+FP16 3513.32NLLB-200 600M+23% (pruned), +57% (pruned+FP16)xx→en average (Table 4)Table 4; Table 5Table 4

What To Try In 7 Days

Download the AfriNLLB CTranslate2 model and test inference latency on your GPU to measure real speedups.

Fine-tune the 548M Transformers checkpoint on a small in-domain sample to check quality recovery for your domain.

Use the authors' filtering pipeline (language ID + semantic + QE) to quickly clean parallel data for an African language of interest.

Optimization Features

Token Efficiency
Higher throughput tokens/sec after pruning and FP16
Infra Optimization
Single A40 48GB GPU training and evaluation pipelines reported
Model Optimization
Iterative layer pruning (greedy layer importance evaluation)Decoder-focused pruning (keeps encoder layers intact where tested)FP16 quantization for faster inference
System Optimization
Released both Transformers (trainable) and CTranslate2 (inference) formats
Training Optimization
Multi-stage fine-tuning (pre-prune then post-prune)Sequence-level knowledge distillation from NLLB-200 3.3B
Inference Optimization
CTranslate2 runtime for batched, efficient decodingBeam size 3 and large token batches (1024 tokens) in evaluation

Reproducibility

Risks & Boundaries

Limitations

Covers 15 language pairs only; many African languages remain unsupported.

Evaluation uses FLORES-200; domain mismatch may affect real-world behavior.

When Not To Use

When highest possible translation quality is required for languages not in the 15 supported pairs.

On devices that require sub-FP16 quantization or very low-memory footprints (edge/CPU only) without further testing.

Failure Modes

Pruning can reduce quality for some language directions, especially when encoder layers are removed.

Semantic filtering depends on available embedding models; Lingala lacked semantic filtering support.

Core Entities

Models

NLLB-200 600M (baseline)NLLB-200 3.3B (teacher)AfriNLLB 548M (pruned + FT)AfriNLLB variants: 12-8 (548M), 12-6, 12-4, 8-8 (498M)

Metrics

BLEUchrF++COMETAfriCOMETThroughput (tokens/s)Inference time (s)

Datasets

Combined curated parallel data (~1.6M samples)Distilled data (568k segments)OPUS sourcesHugging Face datasetsFLORES-200 (dev/devtest for validation and test)

Benchmarks

FLORES-200