Overview
Production Readiness
0.6
Novelty Score
0.2
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
You can make small-data multilingual models practical on constrained hardware: pruning and quantization cut size and latency substantially while keeping most accuracy, enabling on-device NER for African languages.
Summary TLDR
This paper tests pruning, knowledge distillation, and quantization on AfriBERTa, a multilingual model trained on <1GB of African-language text. Key practical results: pruning keeps competitive F1 up to ~50–60% sparsity and can reach ≈60% parameter reduction with small drops; distillation compresses 22–31% with ~1–2% average F1 loss; LLM.int8() quantization cuts model size ~64% and inference latency ~52% while often keeping F1 close to the original. Results vary by language — complex or sparse languages degrade faster.
Problem Statement
Compression has been studied on large models but not on models trained on very small datasets. For low-resource languages and constrained hardware (the “low-resource double-bind”), we need to know whether pruning, distillation, and quantization still help and how far we can compress without breaking accuracy.
Main Contribution
Systematic evaluation of pruning, distillation, and quantization on AfriBERTa, a model trained on <1 GB of African-language data.
Empirical limits: pruning stable up to ~50–60% sparsity; some languages tolerate extreme pruning, others collapse.
Practical efficiency numbers: distillation gives 22–31% compression with ~1–2% average F1 loss; LLM.int8() yields ≈64% size cut and ≈52% latency reduction.
Key Findings
Pruning can cut parameters by ≈60% while keeping usable accuracy.
Distillation achieves 22–31% compression with small average F1 loss.
LLM.int8() quantization cuts model size and latency with modest accuracy loss.
Dynamic quantization performs worse than LLM.int8() on F1.
When you prune matters: before vs after fine-tuning changes robustness.
Results
Distillation: avg F1 (teacher vs student)
Accuracy
Quantization: model size reduction
Quantization: inference latency
Per-language quantization F1 example
Who Should Care
What To Try In 7 Days
Apply LLM.int8() post-training quantization to your fine-tuned model and measure latency and F1 per language.
Distill a 20–30% smaller student (task-agnostic first) and check avg F1 drop; iterate teacher/student layer ratios.
Prune before fine-tuning up to 50–60% sparsity for high compression with controlled loss; validate per language.
Optimization Features
Infra Optimization
- Quantize to reduce memory footprint for edge deployment
Model Optimization
- Unstructured magnitude pruning 10–95% sparsity
- Task-agnostic and task-specific knowledge distillation
- Post-training quantization (LLM.int8(), dynamic)
System Optimization
- Measure per-language inference times; pruning reduces inference time variably
Training Optimization
- Distillation pretrain+fine-tune and task-specific fine-tune distillation
- Prune-before-finetune vs prune-after-finetune comparisons
Inference Optimization
- LLM.int8() to reduce latency ~52%
- Accuracy
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments limited to NER; findings may not generalize to other tasks.
- Single model family (AfriBERTa) and datasets; other architectures may behave differently.
- Language-specific behaviour varies; complex or sparse languages degrade faster.
When Not To Use
- Do not apply aggressive pruning (>70%) for complex or very sparse languages without per-language validation.
- Avoid dynamic quantization when preserving F1 is critical; prefer LLM.int8().
- Don't assume distillation will preserve performance on tasks other than NER without testing.
Failure Modes
- Pruning after fine-tuning beyond ~70% can collapse performance to near-zero for some languages (Appendix D.1).
- Dynamic quantization can cause substantial F1 drops on several languages (Table 3).
- Zero-shot or cross-lingual transfer degrades sharply past certain sparsity thresholds.
Core Entities
Models
- AfriBERTa-base
- AfriBERTa-large
- Distilled AfriBERTa variants
Metrics
- F1
Datasets
- AfriBERTa corpus (0.91 GB)
- MasakhaNER
- MasakhaNER 2.0
- MSRA NER
Context Entities
Models
- mBERT
- XLM-R
Metrics
- inference time

