Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
You can substantially improve African-language quality and document translation by continued pretraining a strong open base model with a curated data mix instead of training from scratch.
Summary TLDR
This paper builds AfriqueLLM, a suite of open models continued-pretrained (CPT) on 26B tokens to adapt 5 base LLMs to 20 African languages. The core finding: what you train on matters more than model size. Mixing monolingual African text with code, math, and high-quality synthetic translations (CMS) consistently improves accuracy and reasoning. Qwen 3 bases showed the largest relative gains after CPT (up to +78.8% rel.), and CPT also improved long-context document translation (e.g., +12.4 d-chrF over an SFT baseline). Models and configs will be released on Hugging Face.
Problem Statement
Open LLMs lag on African languages because pretraining corpora lack domain coverage (math, code, curated topical content). Continued pre-training can help but often degrades reasoning or high-resource language (HRL) performance when data is imbalanced or noisy. The paper asks: which data mixes and base-model choices yield the best CPT outcomes for African languages?
Main Contribution
AfriqueLLM: CPT-adapted models for 20 African languages using a 26B-token corpus.
Systematic CPT ablation across five base models (Gemma 3, Llama 3.1, Qwen 3) and multiple data mixtures.
Practical recipe (CMS = Monolingual + Code + Math + high-quality Synthetic translations) that preserves reasoning and boosts translation.
Empirical finding that base-model capability and data composition outweigh raw parameter scale for CPT gains.
Demonstrated improved long-context document translation without in-domain fine-tuning.
Key Findings
CPT data composition is the single strongest driver of gains.
Adding math and code recovers and improves reasoning degraded by raw web text.
A strong base model capability beats prior multilingual coverage for CPT.
High-quality synthetic translations help larger models more than noisy parallel data.
CPT improves long-context document translation without task-specific fine-tuning.
Results
AfroBench overall (combined tasks)
AfriMGSM (math)
Document-level translation (eng→xx) d-chrF
Relative improvement from CPT (example)
Who Should Care
What To Try In 7 Days
Run a short CPT pass on your base model using a CMS mix: monolingual African text + ~1B tokens each of code and math + filtered synthetic translations.
Limit high-resource languages per UniMax-like sampling (≈1B tokens) to avoid domination by English/French.
Use a 16k context window if you need document-level capabilities and test with d-chrF or SSA-COMET on representative docs.
Optimization Features
Token Efficiency
- UniMax sampling to rebalance languages
Infra Optimization
- H100 clusters (16 nodes/64 GPUs used in runs)
Model Optimization
- sequence packing for throughput
- 16k context window tuning
System Optimization
- Mixed precision bf16 and gradient accumulation
- dynamic gradient accumulation to match hardware
Training Optimization
- DeepSpeed ZeRO-1/2 for memory
- FlashAttention-3 and Liger kernel for speed
- learning-rate and scheduler ablations (cosine, warmup)
Inference Optimization
- vLLM backend for evaluation
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Covers 20 African languages; many languages remain unsupported.
- Model sizes limited to ≤14B; dynamics may change at 30B+.
- Focus on base-model CPT only; no instruction tuning was performed.
- Larger models show sensitivity to noisy parallel data and hyperparameter heuristics.
When Not To Use
- If your target language is not in the 20 covered languages (limited transfer to unseen languages).
- When instruction-following behavior is required immediately—these are base CPT checkpoints, not instruction-tuned models.
- When you must avoid any HRL degradation and cannot afford even small drops in English/French performance.
Failure Modes
- Catastrophic forgetting on high-resource languages if HRLs are excluded or uncapped.
- Quality-sensitive: noisy parallel corpora can harm larger models (12B+).
- Limited transfer: CPT benefits mostly languages included in the mixture, not unseen languages.
Core Entities
Models
- AfriqueQwen-14B
- AfriqueQwen-8B
- AfriqueGemma-12B
- AfriqueGemma-4B
- AfriqueLlama-8B
- Qwen 3 8B
- Qwen 3 14B
- Gemma 3 4B
- Gemma 3 12B
- Llama 3.1 8B
Metrics
- SSA-COMET (MT semantic metric)
- Accuracy
- d-chrF (document chrF)
- chrF++
Datasets
- FineWeb2
- WURA
- MADLAD-400
- CornStack-Python (Code)
- FineMath (Math)
- NLLB-OPUS (Parallel)
- Synthetic GPT-4.1 translations
- OpenMathReasoning (math cot)
Benchmarks
- AfroBench
- AfroBench-Lite
- AfriMGSM
- AfriMMLU
- AfriXNLI
- Flores
- Belebele
- Injongo
- SIB-200
- AFRIDOC-MT

