Pruning, distillation, and quantization make a small-data African language model much cheaper with small accuracy trade-offs

April 6, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.2

Cost Impact Score

0.7

Citation Count

1

Authors

Busayo Awobade, Mardiyyah Oduwole, Steven Kolawole

Links

Abstract / PDF

Why It Matters For Business

You can make small-data multilingual models practical on constrained hardware: pruning and quantization cut size and latency substantially while keeping most accuracy, enabling on-device NER for African languages.

Summary TLDR

This paper tests pruning, knowledge distillation, and quantization on AfriBERTa, a multilingual model trained on <1GB of African-language text. Key practical results: pruning keeps competitive F1 up to ~50–60% sparsity and can reach ≈60% parameter reduction with small drops; distillation compresses 22–31% with ~1–2% average F1 loss; LLM.int8() quantization cuts model size ~64% and inference latency ~52% while often keeping F1 close to the original. Results vary by language — complex or sparse languages degrade faster.

Problem Statement

Compression has been studied on large models but not on models trained on very small datasets. For low-resource languages and constrained hardware (the “low-resource double-bind”), we need to know whether pruning, distillation, and quantization still help and how far we can compress without breaking accuracy.

Main Contribution

Systematic evaluation of pruning, distillation, and quantization on AfriBERTa, a model trained on <1 GB of African-language data.

Empirical limits: pruning stable up to ~50–60% sparsity; some languages tolerate extreme pruning, others collapse.

Practical efficiency numbers: distillation gives 22–31% compression with ~1–2% average F1 loss; LLM.int8() yields ≈64% size cut and ≈52% latency reduction.

Key Findings

Pruning can cut parameters by ≈60% while keeping usable accuracy.

Numbers≈60% model size reduction; average F1 still competitive at 60% sparsity (see §3.3, Table 8)

Distillation achieves 22–31% compression with small average F1 loss.

Numbers22–31% compression; ~1.3–1.9% avg F1 drop vs teachers (Table 2, §3.1)

LLM.int8() quantization cuts model size and latency with modest accuracy loss.

NumbersModel size −64.08% and inference time −52.3% (average); avg F1 decrease ~4.7% vs baseline (Table 3, §3.6)

Dynamic quantization performs worse than LLM.int8() on F1.

NumbersDynamic quantization produced larger F1 drops (per-language table), LLM.int8() avg F1 drop ~4.7% (Table 3)

When you prune matters: before vs after fine-tuning changes robustness.

NumbersPruning-before: steady up to 60% sparsity; pruning-after: matches dense up to 50% then collapses beyond ~70% (Fig.1, §3.

Results

Distillation: avg F1 (teacher vs student)

ValueAfriBERTa-large teacher avg F1 79.05 → distilled student avg ~77.50

BaselineAfriBERTa-large 79.05 avg F1

Accuracy

ValueF1 stays near dense up to 50–60% sparsity; some languages >70% still competitive

Baselinedense model F1 per-language

Quantization: model size reduction

ValueLLM.int8() reduces model size ≈64.08%; dynamic quantization ≈42.44%

Baselineoriginal fine-tuned large AfriBERTa

Quantization: inference latency

ValueLLM.int8() avg inference time −52.3%; dynamic −40.9%

Baselinebaseline CPU inference times (per-language Table 4)

Per-language quantization F1 example

ValueSwahili: baseline 87.89 → LLM.int8() 87.93 (similar or slightly better)

Baselinebaseline Swahili F1 87.89

Who Should Care

What To Try In 7 Days

Apply LLM.int8() post-training quantization to your fine-tuned model and measure latency and F1 per language.

Distill a 20–30% smaller student (task-agnostic first) and check avg F1 drop; iterate teacher/student layer ratios.

Prune before fine-tuning up to 50–60% sparsity for high compression with controlled loss; validate per language.

Optimization Features

Infra Optimization

  • Quantize to reduce memory footprint for edge deployment

Model Optimization

  • Unstructured magnitude pruning 10–95% sparsity
  • Task-agnostic and task-specific knowledge distillation
  • Post-training quantization (LLM.int8(), dynamic)

System Optimization

  • Measure per-language inference times; pruning reduces inference time variably

Training Optimization

  • Distillation pretrain+fine-tune and task-specific fine-tune distillation
  • Prune-before-finetune vs prune-after-finetune comparisons

Inference Optimization

  • LLM.int8() to reduce latency ~52%
  • Accuracy

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments limited to NER; findings may not generalize to other tasks.
  • Single model family (AfriBERTa) and datasets; other architectures may behave differently.
  • Language-specific behaviour varies; complex or sparse languages degrade faster.

When Not To Use

  • Do not apply aggressive pruning (>70%) for complex or very sparse languages without per-language validation.
  • Avoid dynamic quantization when preserving F1 is critical; prefer LLM.int8().
  • Don't assume distillation will preserve performance on tasks other than NER without testing.

Failure Modes

  • Pruning after fine-tuning beyond ~70% can collapse performance to near-zero for some languages (Appendix D.1).
  • Dynamic quantization can cause substantial F1 drops on several languages (Table 3).
  • Zero-shot or cross-lingual transfer degrades sharply past certain sparsity thresholds.

Core Entities

Models

  • AfriBERTa-base
  • AfriBERTa-large
  • Distilled AfriBERTa variants

Metrics

  • F1

Datasets

  • AfriBERTa corpus (0.91 GB)
  • MasakhaNER
  • MasakhaNER 2.0
  • MSRA NER

Context Entities

Models

  • mBERT
  • XLM-R

Metrics

  • inference time