Overview
Well controlled experiments across pruning, distillation, and two quantization methods on AfriBERTa with per-language tables. Results are empirical but limited to NER and one model family.
Citations1
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 20%
Why It Matters For Business
You can make small-data multilingual models practical on constrained hardware: pruning and quantization cut size and latency substantially while keeping most accuracy, enabling on-device NER for African languages.
Who Should Care
Summary TLDR
This paper tests pruning, knowledge distillation, and quantization on AfriBERTa, a multilingual model trained on <1GB of African-language text. Key practical results: pruning keeps competitive F1 up to ~50–60% sparsity and can reach ≈60% parameter reduction with small drops; distillation compresses 22–31% with ~1–2% average F1 loss; LLM.int8() quantization cuts model size ~64% and inference latency ~52% while often keeping F1 close to the original. Results vary by language — complex or sparse languages degrade faster.
Problem Statement
Compression has been studied on large models but not on models trained on very small datasets. For low-resource languages and constrained hardware (the “low-resource double-bind”), we need to know whether pruning, distillation, and quantization still help and how far we can compress without breaking accuracy.
Main Contribution
Systematic evaluation of pruning, distillation, and quantization on AfriBERTa, a model trained on <1 GB of African-language data.
Empirical limits: pruning stable up to ~50–60% sparsity; some languages tolerate extreme pruning, others collapse.
Key Findings
Pruning can cut parameters by ≈60% while keeping usable accuracy.
Distillation achieves 22–31% compression with small average F1 loss.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Distillation: avg F1 (teacher vs student) | AfriBERTa-large teacher avg F1 79.05 → distilled student avg ~77.50 | AfriBERTa-large 79.05 avg F1 | −1.55 avg F1 | MasakhaNER NER (10 languages) | Table 2 shows teacher vs student averages | Table 2; §3.1 |
| Accuracy | F1 stays near dense up to 50–60% sparsity; some languages >70% still competitive | dense model F1 per-language | degrades sharply beyond ~70% when pruning after fine-tuning | MasakhaNER; OOD tests (MasakhaNER 2.0, MSRA) | Figure 1; Appendix D.1; §3.3–3.5 | Fig.1; §3.3; D.1 |
What To Try In 7 Days
Apply LLM.int8() post-training quantization to your fine-tuned model and measure latency and F1 per language.
Distill a 20–30% smaller student (task-agnostic first) and check avg F1 drop; iterate teacher/student layer ratios.
Prune before fine-tuning up to 50–60% sparsity for high compression with controlled loss; validate per language.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Experiments limited to NER; findings may not generalize to other tasks.
Single model family (AfriBERTa) and datasets; other architectures may behave differently.
When Not To Use
Do not apply aggressive pruning (>70%) for complex or very sparse languages without per-language validation.
Avoid dynamic quantization when preserving F1 is critical; prefer LLM.int8().
Failure Modes
Pruning after fine-tuning beyond ~70% can collapse performance to near-zero for some languages (Appendix D.1).
Dynamic quantization can cause substantial F1 drops on several languages (Table 3).

