EthioLLM: open multilingual LLMs and a new EthioBenchmark for five Ethiopian languages plus English

Overview

Decision SnapshotNeeds Validation

Models and datasets are open-sourced and show competitive results on many classification tasks, but language coverage is limited to five languages and MT quality lags SOTA, so treat them as strong research and pilot assets rather than turnkey production systems.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 6/8

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

Atnafu Lambebo Tonja, Israel Abebe Azime, Tadesse Destaw Belay, Mesay Gemeda Yigezu, Moges Ahmed Mehamed, Abinew Ali Ayele, Ebrahim Chekol Jibril, Michael Melese Woldeyohannis, Olga Kolesnikova, Philipp Slusallek, Dietrich Klakow, Shengwu Xiong, Seid Muhie Yimam

Links

Abstract / PDF / Code / Data

Why It Matters For Business

EthioLLM and EthioBenchmark make practical NLP for major Ethiopian languages possible with open models and data, lowering development time for local products like moderation, news categorization, and information extraction.

Who Should Care

ML Engineer Product Manager Founder Data Scientist

Summary TLDR

This paper releases EthioLLM, a family of encoder-only and encoder-decoder LLMs (small/base/large) trained to support five Ethiopian languages (Amharic, Ge'ez, Afaan Oromo, Somali, Tigrinya) plus English, and EthioBenchmark — merged datasets for news, MT, hate speech, sentiment, NER and POS. Models were trained from XLM-R and mT5 checkpoints, with focused cleaning and long training steps. On classification, NER and hate-speech tasks EthioLLM models are competitive with or exceed Afro-centric baselines. Machine translation lags behind larger SOTA MT models. All models, tokenizers and benchmark data are open-sourced.

Problem Statement

Ethiopian languages are underrepresented in large language models and benchmarks. There are many spoken languages but few pre-trained models and cross-language datasets. This gap blocks practical NLP for Ethiopian languages and slows local research and products.

Main Contribution

Released EthioLLM models (small/base/large) covering five Ethiopian languages and English.

Built EthioBenchmark by merging existing datasets into task-specific collections: EthioNEWS, EthioMT, EthioHate, EthioSenti, EthioNER, EthioPOS.

Key Findings

EthioLLM-large achieves competitive or better results on news classification for Amharic.

NumbersMasakhaNEWS Amharic weighted F1: EthioLLM-large 94.18 vs XLM-R 93.1

Practical UseUse EthioLLM-large as a drop-in model for Amharic news classification when you need a compact local model competitive with general multilingual LMs.

Evidence RefTable 3

EthioLLM-large outperforms prior models on Amharic NER.

NumbersMasakhaNER Amharic F1: EthioLLM-large 79.42 vs AfroXLMR 78.0

Practical UsePrefer EthioLLM-large for Amharic NER tasks where named-entity extraction quality matters.

Evidence RefTable 8

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
weighted F1 (news, MasakhaNEWS, amh)	94.18 (EthioLLM-large)	94.4 (AfroXLMR-large)	-0.22	MasakhaNEWS	Table 3 - EthioLLM-large 94.18; AfroXLMR-l 94.4	Table 3
weighted F1 (NER, amh)	79.42 (EthioLLM-large)	78.0 (AfroXLMR-large)	+1.42	MasakhaNER	Table 8 - EthioLLM-large 79.42; AfroXLMR-l 78	Table 8

What To Try In 7 Days

Install EthioLLM-small from the repo and test Amharic news classification on your data.

Run EthioLLM-large for Amharic NER to see entity extraction improvements vs your current pipeline.

Evaluate EthioLLM-large for Amharic hate-speech detection in a moderation pilot and compare F1 to your baseline.

Optimization Features

Token Efficiency

Accuracy

Model Optimization

vocabulary tuning (70K and 250K tokenizers)model variants: small/base/large sizes

Training Optimization

language-adaptive fine-tuning (LAFT)long training runs (up to 1M steps for seq2seq)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/EthioNLP/EthioLLM

Data URLs

https://github.com/EthioNLP/EthioLLMEthioNLP HuggingFace repository (paper)

Risks & Boundaries

Limitations

Covers only 5 of >85 Ethiopian languages due to scarce corpora.

Ge'ez training data is tiny (≈1M tokens), limiting model strength for that script.

When Not To Use

As a production-grade MT system where high BLEU is required.

For languages not included in the five-language set.

Failure Modes

Poor MT quality compared to NLLB/M2M100 for many language pairs (large BLEU gaps).

Lower performance on languages with little training data (Ge'ez); zero-shot may still fail on domain shifts.

Core Entities

Models

EthioLLM-smallEthioLLM-baseEthioLLM-largeEthioMT5-smallXLM-RAfroXLMRAfroLMAfriTeVaAfriMT5M2M100NLLB

Metrics

weighted F1sacreBLEU

Datasets

EthioBenchmarkEthioNEWSEthioMTEthioHateEthioSentiEthioNEREthioPOSMasakhaNEWSMasakhaNERAfriSentiFlores-200HornMT

Benchmarks

MasakhaNEWSMasakhaNERAfriSentiEthioBenchmark

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

EthioLLM-large achieves competitive or better results on news classification for Amharic.

EthioLLM-large outperforms prior models on Amharic NER.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

BiasLab: a multilingual, dual-framing toolkit for robust output-level bias audits

Key finding

Decouple concepts from language: an MoE design that keeps strong multilingual accuracy and cuts token costs

Key finding

MoZIP: a 3-part multilingual benchmark plus an IP-tuned 7B model to test how well LLMs handle patent and IP tasks

Key finding

A new Hindi analogy test (HATS) shows multilingual LLMs reason better when prompted in English and still make language-specific mistakes.

Key finding