Data mix (math, code, synthetic) plus the right base model beats scale for African-language CPT

January 10, 20268 min

Overview

Decision SnapshotNeeds Validation

Paper provides clear ablations and multiple baselines showing consistent trends, but experiments stop at 14B and some datasets or code URLs are not fully released.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 50%

Authors

Hao Yu, Tianyi Xu, Michael A. Hedderich, Wassim Hamidouche, Syed Waqas Zamir, David Ifeoluwa Adelani

Links

Abstract / PDF

Why It Matters For Business

You can substantially improve African-language quality and document translation by continued pretraining a strong open base model with a curated data mix instead of training from scratch.

Who Should Care

Summary TLDR

This paper builds AfriqueLLM, a suite of open models continued-pretrained (CPT) on 26B tokens to adapt 5 base LLMs to 20 African languages. The core finding: what you train on matters more than model size. Mixing monolingual African text with code, math, and high-quality synthetic translations (CMS) consistently improves accuracy and reasoning. Qwen 3 bases showed the largest relative gains after CPT (up to +78.8% rel.), and CPT also improved long-context document translation (e.g., +12.4 d-chrF over an SFT baseline). Models and configs will be released on Hugging Face.

Problem Statement

Open LLMs lag on African languages because pretraining corpora lack domain coverage (math, code, curated topical content). Continued pre-training can help but often degrades reasoning or high-resource language (HRL) performance when data is imbalanced or noisy. The paper asks: which data mixes and base-model choices yield the best CPT outcomes for African languages?

Main Contribution

AfriqueLLM: CPT-adapted models for 20 African languages using a 26B-token corpus.

Systematic CPT ablation across five base models (Gemma 3, Llama 3.1, Qwen 3) and multiple data mixtures.

Key Findings

CPT data composition is the single strongest driver of gains.

NumbersCMS recipe gave best scores on multiple tasks (e.g., Flores 66.23 at 12B).

Practical UsePrioritize adding code, math, and curated synthetic translations when adapting an LLM rather than only more monolingual web text.

Evidence RefTable 2; Table 3

Adding math and code recovers and improves reasoning degraded by raw web text.

NumbersGemma 4B: AfriMGSM rose from 9.2510.68 (M→CM) and larger jumps in AfriqueQwen variants.

Practical UseInclude ~1B tokens each of code and math in CPT to restore reasoning priors lost by noisy monolingual data.

Evidence RefTable 2; Table 14

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AfroBench overall (combined tasks)AfriqueQwen-14B 63.79Qwen3-14B base 39.88+23.91 abs (+60.0% rel)AfroBench-Lite (see Table 3)Table 3: overall scores and ∆ %Table 3
AfriMGSM (math)AfriqueQwen-14B 45.01Qwen3-14B base 16.6+28.41 absAfriMGSM (8-shot CoT)Table 3 AfriMGSM columnTable 3

What To Try In 7 Days

Run a short CPT pass on your base model using a CMS mix: monolingual African text + ~1B tokens each of code and math + filtered synthetic translations.

Limit high-resource languages per UniMax-like sampling (≈1B tokens) to avoid domination by English/French.

Use a 16k context window if you need document-level capabilities and test with d-chrF or SSA-COMET on representative docs.

Optimization Features

Token Efficiency
UniMax sampling to rebalance languages
Infra Optimization
H100 clusters (16 nodes/64 GPUs used in runs)
Model Optimization
sequence packing for throughput16k context window tuning
System Optimization
Mixed precision bf16 and gradient accumulationdynamic gradient accumulation to match hardware
Training Optimization
DeepSpeed ZeRO-1/2 for memoryFlashAttention-3 and Liger kernel for speedlearning-rate and scheduler ablations (cosine, warmup)
Inference Optimization
vLLM backend for evaluation

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Covers 20 African languages; many languages remain unsupported.

Model sizes limited to ≤14B; dynamics may change at 30B+.

When Not To Use

If your target language is not in the 20 covered languages (limited transfer to unseen languages).

When instruction-following behavior is required immediately—these are base CPT checkpoints, not instruction-tuned models.

Failure Modes

Catastrophic forgetting on high-resource languages if HRLs are excluded or uncapped.

Quality-sensitive: noisy parallel corpora can harm larger models (12B+).

Core Entities

Models

AfriqueQwen-14BAfriqueQwen-8BAfriqueGemma-12BAfriqueGemma-4BAfriqueLlama-8BQwen 3 8BQwen 3 14BGemma 3 4BGemma 3 12BLlama 3.1 8B

Metrics

SSA-COMET (MT semantic metric)Accuracyd-chrF (document chrF)chrF++

Datasets

FineWeb2WURAMADLAD-400CornStack-Python (Code)FineMath (Math)NLLB-OPUS (Parallel)Synthetic GPT-4.1 translationsOpenMathReasoning (math cot)

Benchmarks

AfroBenchAfroBench-LiteAfriMGSMAfriMMLUAfriXNLIFloresBelebeleInjongoSIB-200AFRIDOC-MT