Smaller, faster NLLB-based models for 15 African language pairs, with released data and code

Overview

Decision SnapshotReady For Pilot

Models, data, and code are released; evaluation across 15 language pairs shows consistent speed gains and quality recovery when fine-tuned, but evaluation is limited to Flores200 and GPU-based inference.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Yasmin Moslem, Aman Kassahun Wassie, Amanuel Gizachew Abebe

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AfriNLLB delivers translation models that are 20–57% faster at inference while keeping similar quality, making deployment in constrained environments (limited GPU or server cost) more affordable and easier.

Who Should Care

ML Engineer Product Manager Founder Data Scientist

Summary TLDR

AfriNLLB compresses NLLB-200 600M via iterative layer pruning and float16 quantization, then recovers quality with multi-stage fine-tuning and knowledge distillation. The authors curate and filter parallel data for 15 language pairs (mostly African), train pruned models (notably a 548M version), and show average translation quality comparable to the baseline while delivering 20–57% faster inference. They release models, code, and training data to enable practical deployment in resource-constrained settings.

Problem Statement

African languages lack compact, deployable translation models and consolidated parallel datasets. Large multilingual models exist but are heavy to run; collecting and cleaning African parallel data is scattered and time-consuming. AfriNLLB aims to make accurate, efficient translation models for African languages and publish the data and code.

Main Contribution

Curated and filtered parallel corpora for 15 language pairs focused on African languages (final training set ~1.6M samples).

Applied iterative layer pruning to NLLB-200 600M to build smaller models (e.g., 548M) and restored quality via fine-tuning and knowledge distillation from NLLB-200 3.3B.

Key Findings

Iterative pruning produced a 548M model that runs faster than baseline.

NumbersThroughput +23% (pruned) and +57% (pruned + FP16) vs NLLB-600M

Practical UseUse the 548M AfriNLLB model to cut inference time substantially on GPU without large quality loss.

Evidence RefTable 4; Table 7

Average translation quality was comparable or slightly improved after fine-tuning and distillation.

NumbersAvg BLEU 26.21 → 27.05 (+3.2%) on evaluated directions

Practical UsePrune then fine-tune with distillation to keep or improve quality while saving compute.

Evidence RefTable 7 (Average row)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average BLEU (all evaluated directions)	Baseline 26.21 → AfriNLLB 27.05	NLLB-200 600M	+3.2%	Flores200 devtest / averaged directions	Table 7 (Average row)	Table 7
Throughput (tokens/sec)	Baseline 1469.96 → Pruned 1807.61 → Pruned+FP16 3513.32	NLLB-200 600M	+23% (pruned), +57% (pruned+FP16)	xx→en average (Table 4)	Table 4; Table 5	Table 4

What To Try In 7 Days

Download the AfriNLLB CTranslate2 model and test inference latency on your GPU to measure real speedups.

Fine-tune the 548M Transformers checkpoint on a small in-domain sample to check quality recovery for your domain.

Use the authors' filtering pipeline (language ID + semantic + QE) to quickly clean parallel data for an African language of interest.

Optimization Features

Token Efficiency

Higher throughput tokens/sec after pruning and FP16

Infra Optimization

Single A40 48GB GPU training and evaluation pipelines reported

Model Optimization

Iterative layer pruning (greedy layer importance evaluation)Decoder-focused pruning (keeps encoder layers intact where tested)FP16 quantization for faster inference

System Optimization

Released both Transformers (trainable) and CTranslate2 (inference) formats

Training Optimization

Multi-stage fine-tuning (pre-prune then post-prune)Sequence-level knowledge distillation from NLLB-200 3.3B

Inference Optimization

CTranslate2 runtime for batched, efficient decodingBeam size 3 and large token batches (1024 tokens) in evaluation

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/AfriNLP/AfriNLLB https://hf.co/collections/AfriNLP/afrinllb

Data URLs

https://hf.co/collections/AfriNLP/afrinllb https://github.com/AfriNLP/AfriNLLB

Risks & Boundaries

Limitations

Covers 15 language pairs only; many African languages remain unsupported.

Evaluation uses FLORES-200; domain mismatch may affect real-world behavior.

When Not To Use

When highest possible translation quality is required for languages not in the 15 supported pairs.

On devices that require sub-FP16 quantization or very low-memory footprints (edge/CPU only) without further testing.

Failure Modes

Pruning can reduce quality for some language directions, especially when encoder layers are removed.

Semantic filtering depends on available embedding models; Lingala lacked semantic filtering support.

Core Entities

Models

NLLB-200 600M (baseline)NLLB-200 3.3B (teacher)AfriNLLB 548M (pruned + FT)AfriNLLB variants: 12-8 (548M), 12-6, 12-4, 8-8 (498M)

Metrics

BLEUchrF++COMETAfriCOMETThroughput (tokens/s)Inference time (s)

Datasets

Combined curated parallel data (~1.6M samples)Distilled data (568k segments)OPUS sourcesHugging Face datasetsFLORES-200 (dev/devtest for validation and test)

Benchmarks

FLORES-200

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Iterative pruning produced a 548M model that runs faster than baseline.

Average translation quality was comparable or slightly improved after fine-tuning and distillation.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding

Compression can preserve or break LLM trust: 4-bit quantization often keeps or even improves ethics/fairness, pruning and 3-bit quantization

Key finding

Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

Key finding

Cut Qwen2-Audio translation models by ~40–50% storage while keeping ~97–100% quality

Key finding