EthioLLM: open multilingual LLMs and a new EthioBenchmark for five Ethiopian languages plus English

March 20, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

1

Authors

Atnafu Lambebo Tonja, Israel Abebe Azime, Tadesse Destaw Belay, Mesay Gemeda Yigezu, Moges Ahmed Mehamed, Abinew Ali Ayele, Ebrahim Chekol Jibril, Michael Melese Woldeyohannis, Olga Kolesnikova, Philipp Slusallek, Dietrich Klakow, Shengwu Xiong, Seid Muhie Yimam

Links

Abstract / PDF

Why It Matters For Business

EthioLLM and EthioBenchmark make practical NLP for major Ethiopian languages possible with open models and data, lowering development time for local products like moderation, news categorization, and information extraction.

Summary TLDR

This paper releases EthioLLM, a family of encoder-only and encoder-decoder LLMs (small/base/large) trained to support five Ethiopian languages (Amharic, Ge'ez, Afaan Oromo, Somali, Tigrinya) plus English, and EthioBenchmark — merged datasets for news, MT, hate speech, sentiment, NER and POS. Models were trained from XLM-R and mT5 checkpoints, with focused cleaning and long training steps. On classification, NER and hate-speech tasks EthioLLM models are competitive with or exceed Afro-centric baselines. Machine translation lags behind larger SOTA MT models. All models, tokenizers and benchmark data are open-sourced.

Problem Statement

Ethiopian languages are underrepresented in large language models and benchmarks. There are many spoken languages but few pre-trained models and cross-language datasets. This gap blocks practical NLP for Ethiopian languages and slows local research and products.

Main Contribution

Released EthioLLM models (small/base/large) covering five Ethiopian languages and English.

Built EthioBenchmark by merging existing datasets into task-specific collections: EthioNEWS, EthioMT, EthioHate, EthioSenti, EthioNER, EthioPOS.

Evaluated models across news classification, MT, hate speech, sentiment, NER, and POS and reported baselines.

Open-sourced models, tokenizers, training corpus and benchmark datasets on GitHub/HuggingFace.

Key Findings

EthioLLM-large achieves competitive or better results on news classification for Amharic.

NumbersMasakhaNEWS Amharic weighted F1: EthioLLM-large 94.18 vs XLM-R 93.1

EthioLLM-large outperforms prior models on Amharic NER.

NumbersMasakhaNER Amharic F1: EthioLLM-large 79.42 vs AfroXLMR 78.0

Hate speech detection improved noticeably with EthioLLM-large.

NumbersEthioHate weighted F1: Amharic 73.54 (EthioLLM-large) vs 67.73 (AfroXLMR)

Part-of-speech tagging for Amharic is strong with the large model.

NumbersEthioPOS Amharic weighted F1: 90.36 (EthioLLM-large)

Machine translation quality is lower than large SOTA MT models.

NumberssacreBLEU eng->amh: EthioMT5-S 17.0 vs M2M100 37.6

Zero-shot transfer to Ge'ez shows promising results despite tiny Ge'ez corpus.

NumbersEthioNER Ge'ez (zero-shot from Amharic) F1: 74.84 (EthioLLM-large)

Results

weighted F1 (news, MasakhaNEWS, amh)

Value94.18 (EthioLLM-large)

Baseline94.4 (AfroXLMR-large)

weighted F1 (NER, amh)

Value79.42 (EthioLLM-large)

Baseline78.0 (AfroXLMR-large)

weighted F1 (hate speech, amh)

Value73.54 (EthioLLM-large)

Baseline67.73 (AfroXLMR-large)

weighted F1 (hate speech, orm)

Value87.28 (EthioLLM-large)

Baseline83.87 (AfroXLMR-large)

weighted F1 (POS, amh)

Value90.36 (EthioLLM-large)

weighted F1 (sentiment, tir)

Value91.09 (EthioLLM-small on EthioSenti)

Baseline89.24 (EthioLLM-base)

sacreBLEU (eng->amh)

Value17.0 (EthioMT5-small)

Baseline37.6 (M2M100/NLLB)

weighted F1 (NER zero-shot, gez from amh)

Value74.84 (EthioLLM-large)

Who Should Care

What To Try In 7 Days

Install EthioLLM-small from the repo and test Amharic news classification on your data.

Run EthioLLM-large for Amharic NER to see entity extraction improvements vs your current pipeline.

Evaluate EthioLLM-large for Amharic hate-speech detection in a moderation pilot and compare F1 to your baseline.

Optimization Features

Token Efficiency

  • Accuracy

Model Optimization

  • vocabulary tuning (70K and 250K tokenizers)
  • model variants: small/base/large sizes

Training Optimization

  • language-adaptive fine-tuning (LAFT)
  • long training runs (up to 1M steps for seq2seq)

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Covers only 5 of >85 Ethiopian languages due to scarce corpora.
  • Ge'ez training data is tiny (≈1M tokens), limiting model strength for that script.
  • Machine translation quality is clearly below large SOTA MT models.
  • Benchmark combines existing datasets; heterogeneity and source overlap may bias results despite cleaning.

When Not To Use

  • As a production-grade MT system where high BLEU is required.
  • For languages not included in the five-language set.
  • When legal/auditable model provenance is required and full license details are needed.

Failure Modes

  • Poor MT quality compared to NLLB/M2M100 for many language pairs (large BLEU gaps).
  • Lower performance on languages with little training data (Ge'ez); zero-shot may still fail on domain shifts.
  • Potential dataset leakage risks if downstream data overlapped with model pretraining—authors state they verified but caution remains.

Core Entities

Models

  • EthioLLM-small
  • EthioLLM-base
  • EthioLLM-large
  • EthioMT5-small
  • XLM-R
  • AfroXLMR
  • AfroLM
  • AfriTeVa
  • AfriMT5
  • M2M100
  • NLLB

Metrics

  • weighted F1
  • sacreBLEU

Datasets

  • EthioBenchmark
  • EthioNEWS
  • EthioMT
  • EthioHate
  • EthioSenti
  • EthioNER
  • EthioPOS
  • MasakhaNEWS
  • MasakhaNER
  • AfriSenti
  • Flores-200
  • HornMT

Benchmarks

  • MasakhaNEWS
  • MasakhaNER
  • AfriSenti
  • EthioBenchmark