Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
AfriNLLB delivers translation models that are 20–57% faster at inference while keeping similar quality, making deployment in constrained environments (limited GPU or server cost) more affordable and easier.
Summary TLDR
AfriNLLB compresses NLLB-200 600M via iterative layer pruning and float16 quantization, then recovers quality with multi-stage fine-tuning and knowledge distillation. The authors curate and filter parallel data for 15 language pairs (mostly African), train pruned models (notably a 548M version), and show average translation quality comparable to the baseline while delivering 20–57% faster inference. They release models, code, and training data to enable practical deployment in resource-constrained settings.
Problem Statement
African languages lack compact, deployable translation models and consolidated parallel datasets. Large multilingual models exist but are heavy to run; collecting and cleaning African parallel data is scattered and time-consuming. AfriNLLB aims to make accurate, efficient translation models for African languages and publish the data and code.
Main Contribution
Curated and filtered parallel corpora for 15 language pairs focused on African languages (final training set ~1.6M samples).
Applied iterative layer pruning to NLLB-200 600M to build smaller models (e.g., 548M) and restored quality via fine-tuning and knowledge distillation from NLLB-200 3.3B.
Demonstrated inference speedups (≈23% for pruned; ≈57% with FP16) while keeping translation quality comparable on Flores200.
Released models in Transformers and CTranslate2 formats, plus code and training data to support reuse.
Key Findings
Iterative pruning produced a 548M model that runs faster than baseline.
Average translation quality was comparable or slightly improved after fine-tuning and distillation.
Training data and distillation volumes.
Pruning strategy matters: iterative pruning beats middle-layer removal.
Results
Average BLEU (all evaluated directions)
Throughput (tokens/sec)
Model size after pruning
Training data used (after filtering and sampling)
Who Should Care
What To Try In 7 Days
Download the AfriNLLB CTranslate2 model and test inference latency on your GPU to measure real speedups.
Fine-tune the 548M Transformers checkpoint on a small in-domain sample to check quality recovery for your domain.
Use the authors' filtering pipeline (language ID + semantic + QE) to quickly clean parallel data for an African language of interest.
Optimization Features
Token Efficiency
- Higher throughput tokens/sec after pruning and FP16
Infra Optimization
- Single A40 48GB GPU training and evaluation pipelines reported
Model Optimization
- Iterative layer pruning (greedy layer importance evaluation)
- Decoder-focused pruning (keeps encoder layers intact where tested)
- FP16 quantization for faster inference
System Optimization
- Released both Transformers (trainable) and CTranslate2 (inference) formats
Training Optimization
- Multi-stage fine-tuning (pre-prune then post-prune)
- Sequence-level knowledge distillation from NLLB-200 3.3B
Inference Optimization
- CTranslate2 runtime for batched, efficient decoding
- Beam size 3 and large token batches (1024 tokens) in evaluation
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Covers 15 language pairs only; many African languages remain unsupported.
- Evaluation uses FLORES-200; domain mismatch may affect real-world behavior.
- Encoder pruning was not fully explored and may degrade quality if used aggressively.
- Quantization tested only as FP16; no lower-bit quantization or CPU-only benchmarks reported.
When Not To Use
- When highest possible translation quality is required for languages not in the 15 supported pairs.
- On devices that require sub-FP16 quantization or very low-memory footprints (edge/CPU only) without further testing.
- For domains not represented in the curated training sources without domain adaptation.
Failure Modes
- Pruning can reduce quality for some language directions, especially when encoder layers are removed.
- Semantic filtering depends on available embedding models; Lingala lacked semantic filtering support.
- Distillation and fine-tuning may still leave brittle outputs for low-resource directions with sparse data.
Core Entities
Models
- NLLB-200 600M (baseline)
- NLLB-200 3.3B (teacher)
- AfriNLLB 548M (pruned + FT)
- AfriNLLB variants: 12-8 (548M), 12-6, 12-4, 8-8 (498M)
Metrics
- BLEU
- chrF++
- COMET
- AfriCOMET
- Throughput (tokens/s)
- Inference time (s)
Datasets
- Combined curated parallel data (~1.6M samples)
- Distilled data (568k segments)
- OPUS sources
- Hugging Face datasets
- FLORES-200 (dev/devtest for validation and test)
Benchmarks
- FLORES-200

