Data mix (math, code, synthetic) plus the right base model beats scale for African-language CPT

Overview

Decision SnapshotNeeds Validation

Paper provides clear ablations and multiple baselines showing consistent trends, but experiments stop at 14B and some datasets or code URLs are not fully released.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 50%

Authors

Hao Yu, Tianyi Xu, Michael A. Hedderich, Wassim Hamidouche, Syed Waqas Zamir, David Ifeoluwa Adelani

Links

Abstract / PDF

Why It Matters For Business

You can substantially improve African-language quality and document translation by continued pretraining a strong open base model with a curated data mix instead of training from scratch.

Who Should Care

ML Engineer Product Manager CTO Data Scientist Engineering Lead Founder

Summary TLDR

This paper builds AfriqueLLM, a suite of open models continued-pretrained (CPT) on 26B tokens to adapt 5 base LLMs to 20 African languages. The core finding: what you train on matters more than model size. Mixing monolingual African text with code, math, and high-quality synthetic translations (CMS) consistently improves accuracy and reasoning. Qwen 3 bases showed the largest relative gains after CPT (up to +78.8% rel.), and CPT also improved long-context document translation (e.g., +12.4 d-chrF over an SFT baseline). Models and configs will be released on Hugging Face.

Problem Statement

Open LLMs lag on African languages because pretraining corpora lack domain coverage (math, code, curated topical content). Continued pre-training can help but often degrades reasoning or high-resource language (HRL) performance when data is imbalanced or noisy. The paper asks: which data mixes and base-model choices yield the best CPT outcomes for African languages?

Main Contribution

AfriqueLLM: CPT-adapted models for 20 African languages using a 26B-token corpus.

Systematic CPT ablation across five base models (Gemma 3, Llama 3.1, Qwen 3) and multiple data mixtures.

Key Findings

CPT data composition is the single strongest driver of gains.

NumbersCMS recipe gave best scores on multiple tasks (e.g., Flores 66.23 at 12B).

Practical UsePrioritize adding code, math, and curated synthetic translations when adapting an LLM rather than only more monolingual web text.

Evidence RefTable 2; Table 3

Adding math and code recovers and improves reasoning degraded by raw web text.

NumbersGemma 4B: AfriMGSM rose from 9.25→10.68 (M→CM) and larger jumps in AfriqueQwen variants.

Practical UseInclude ~1B tokens each of code and math in CPT to restore reasoning priors lost by noisy monolingual data.

Evidence RefTable 2; Table 14

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
AfroBench overall (combined tasks)	AfriqueQwen-14B 63.79	Qwen3-14B base 39.88	+23.91 abs (+60.0% rel)	AfroBench-Lite (see Table 3)	Table 3: overall scores and ∆ %	Table 3
AfriMGSM (math)	AfriqueQwen-14B 45.01	Qwen3-14B base 16.6	+28.41 abs	AfriMGSM (8-shot CoT)	Table 3 AfriMGSM column	Table 3

What To Try In 7 Days

Run a short CPT pass on your base model using a CMS mix: monolingual African text + ~1B tokens each of code and math + filtered synthetic translations.

Limit high-resource languages per UniMax-like sampling (≈1B tokens) to avoid domination by English/French.

Use a 16k context window if you need document-level capabilities and test with d-chrF or SSA-COMET on representative docs.

Optimization Features

Token Efficiency

UniMax sampling to rebalance languages

Infra Optimization

H100 clusters (16 nodes/64 GPUs used in runs)

Model Optimization

sequence packing for throughput16k context window tuning

System Optimization

Mixed precision bf16 and gradient accumulationdynamic gradient accumulation to match hardware

Training Optimization

DeepSpeed ZeRO-1/2 for memoryFlashAttention-3 and Liger kernel for speedlearning-rate and scheduler ablations (cosine, warmup)

Inference Optimization

vLLM backend for evaluation

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Covers 20 African languages; many languages remain unsupported.

Model sizes limited to ≤14B; dynamics may change at 30B+.

When Not To Use

If your target language is not in the 20 covered languages (limited transfer to unseen languages).

When instruction-following behavior is required immediately—these are base CPT checkpoints, not instruction-tuned models.

Failure Modes

Catastrophic forgetting on high-resource languages if HRLs are excluded or uncapped.

Quality-sensitive: noisy parallel corpora can harm larger models (12B+).

Core Entities

Models

AfriqueQwen-14BAfriqueQwen-8BAfriqueGemma-12BAfriqueGemma-4BAfriqueLlama-8BQwen 3 8BQwen 3 14BGemma 3 4BGemma 3 12BLlama 3.1 8B

Metrics

SSA-COMET (MT semantic metric)Accuracyd-chrF (document chrF)chrF++

Datasets

FineWeb2WURAMADLAD-400CornStack-Python (Code)FineMath (Math)NLLB-OPUS (Parallel)Synthetic GPT-4.1 translationsOpenMathReasoning (math cot)

Benchmarks

AfroBenchAfroBench-LiteAfriMGSMAfriMMLUAfriXNLIFloresBelebeleInjongoSIB-200AFRIDOC-MT

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CPT data composition is the single strongest driver of gains.

Adding math and code recovers and improves reasoning degraded by raw web text.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

Key finding

Pre-train LLMs to use search tools: mask-and-search task (RAMP) improves multi-step retrieval and reasoning

Key finding

Survey + benchmark of memory- and parameter-efficient LLM pretraining; two small tricks cut memory ~25% while closing the gap to full-rank

Key finding

Survey: how to update LLMs continuously without full retraining

Key finding