Data mix (math, code, synthetic) plus the right base model beats scale for African-language CPT

January 10, 20268 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

0

Authors

Hao Yu, Tianyi Xu, Michael A. Hedderich, Wassim Hamidouche, Syed Waqas Zamir, David Ifeoluwa Adelani

Links

Abstract / PDF

Why It Matters For Business

You can substantially improve African-language quality and document translation by continued pretraining a strong open base model with a curated data mix instead of training from scratch.

Summary TLDR

This paper builds AfriqueLLM, a suite of open models continued-pretrained (CPT) on 26B tokens to adapt 5 base LLMs to 20 African languages. The core finding: what you train on matters more than model size. Mixing monolingual African text with code, math, and high-quality synthetic translations (CMS) consistently improves accuracy and reasoning. Qwen 3 bases showed the largest relative gains after CPT (up to +78.8% rel.), and CPT also improved long-context document translation (e.g., +12.4 d-chrF over an SFT baseline). Models and configs will be released on Hugging Face.

Problem Statement

Open LLMs lag on African languages because pretraining corpora lack domain coverage (math, code, curated topical content). Continued pre-training can help but often degrades reasoning or high-resource language (HRL) performance when data is imbalanced or noisy. The paper asks: which data mixes and base-model choices yield the best CPT outcomes for African languages?

Main Contribution

AfriqueLLM: CPT-adapted models for 20 African languages using a 26B-token corpus.

Systematic CPT ablation across five base models (Gemma 3, Llama 3.1, Qwen 3) and multiple data mixtures.

Practical recipe (CMS = Monolingual + Code + Math + high-quality Synthetic translations) that preserves reasoning and boosts translation.

Empirical finding that base-model capability and data composition outweigh raw parameter scale for CPT gains.

Demonstrated improved long-context document translation without in-domain fine-tuning.

Key Findings

CPT data composition is the single strongest driver of gains.

NumbersCMS recipe gave best scores on multiple tasks (e.g., Flores 66.23 at 12B).

Adding math and code recovers and improves reasoning degraded by raw web text.

NumbersGemma 4B: AfriMGSM rose from 9.25→10.68 (M→CM) and larger jumps in AfriqueQwen variants.

A strong base model capability beats prior multilingual coverage for CPT.

NumbersAfriqueQwen-8B overall 59.28 vs Qwen3-8B base 33.16 (∆ +26.1 abs, +78.8% rel.).

High-quality synthetic translations help larger models more than noisy parallel data.

NumbersFor 12B, CMS outperformed CMSP; adding NLLB parallel data harmed 12B performance.

CPT improves long-context document translation without task-specific fine-tuning.

Numberseng→xx d-chrF: AfriqueGemma-12B 60.2 vs SFT baseline 47.8 (+12.4).

Results

AfroBench overall (combined tasks)

ValueAfriqueQwen-14B 63.79

BaselineQwen3-14B base 39.88

AfriMGSM (math)

ValueAfriqueQwen-14B 45.01

BaselineQwen3-14B base 16.6

Document-level translation (eng→xx) d-chrF

ValueAfriqueGemma-12B 60.2

BaselineLlama 3.1 SFT-10 baseline 47.8

Relative improvement from CPT (example)

ValueAfriqueQwen-8B +78.8% rel overall

BaselineQwen3-8B base

Who Should Care

What To Try In 7 Days

Run a short CPT pass on your base model using a CMS mix: monolingual African text + ~1B tokens each of code and math + filtered synthetic translations.

Limit high-resource languages per UniMax-like sampling (≈1B tokens) to avoid domination by English/French.

Use a 16k context window if you need document-level capabilities and test with d-chrF or SSA-COMET on representative docs.

Optimization Features

Token Efficiency

  • UniMax sampling to rebalance languages

Infra Optimization

  • H100 clusters (16 nodes/64 GPUs used in runs)

Model Optimization

  • sequence packing for throughput
  • 16k context window tuning

System Optimization

  • Mixed precision bf16 and gradient accumulation
  • dynamic gradient accumulation to match hardware

Training Optimization

  • DeepSpeed ZeRO-1/2 for memory
  • FlashAttention-3 and Liger kernel for speed
  • learning-rate and scheduler ablations (cosine, warmup)

Inference Optimization

  • vLLM backend for evaluation

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Covers 20 African languages; many languages remain unsupported.
  • Model sizes limited to ≤14B; dynamics may change at 30B+.
  • Focus on base-model CPT only; no instruction tuning was performed.
  • Larger models show sensitivity to noisy parallel data and hyperparameter heuristics.

When Not To Use

  • If your target language is not in the 20 covered languages (limited transfer to unseen languages).
  • When instruction-following behavior is required immediately—these are base CPT checkpoints, not instruction-tuned models.
  • When you must avoid any HRL degradation and cannot afford even small drops in English/French performance.

Failure Modes

  • Catastrophic forgetting on high-resource languages if HRLs are excluded or uncapped.
  • Quality-sensitive: noisy parallel corpora can harm larger models (12B+).
  • Limited transfer: CPT benefits mostly languages included in the mixture, not unseen languages.

Core Entities

Models

  • AfriqueQwen-14B
  • AfriqueQwen-8B
  • AfriqueGemma-12B
  • AfriqueGemma-4B
  • AfriqueLlama-8B
  • Qwen 3 8B
  • Qwen 3 14B
  • Gemma 3 4B
  • Gemma 3 12B
  • Llama 3.1 8B

Metrics

  • SSA-COMET (MT semantic metric)
  • Accuracy
  • d-chrF (document chrF)
  • chrF++

Datasets

  • FineWeb2
  • WURA
  • MADLAD-400
  • CornStack-Python (Code)
  • FineMath (Math)
  • NLLB-OPUS (Parallel)
  • Synthetic GPT-4.1 translations
  • OpenMathReasoning (math cot)

Benchmarks

  • AfroBench
  • AfroBench-Lite
  • AfriMGSM
  • AfriMMLU
  • AfriXNLI
  • Flores
  • Belebele
  • Injongo
  • SIB-200
  • AFRIDOC-MT