Okapi: first open-source RLHF instruction-tuned LLMs across 26 languages

Overview

Decision SnapshotNeeds Validation

The approach is practical and repeatable, but results are modest, sensitive to translation/ranking quality, and stronger on general-knowledge tasks than on specialized benchmarks.

Citations3

Evidence Strength0.70

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Viet Dac Lai, Chien Van Nguyen, Nghia Trung Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan A. Rossi, Thien Huu Nguyen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you need multilingual chat or QA features, investing in translated instructions and RLHF can yield measurable accuracy gains and broader language coverage while keeping models open-source.

Who Should Care

ML Engineer Data Scientist Product Manager CTO Founder

Summary TLDR

Okapi builds multilingual instruction datasets, ranked responses, and translated benchmarks in 26 languages and uses supervised fine-tuning (SFT) plus reinforcement learning from human feedback (RLHF) to produce multilingual LLMs. They generate 158K English instructions (52K Alpaca + 106K new), translate them, obtain 42K ranked responses per language, and train BLOOM-7B and LLaMA-7B variants. Across ARC and HellaSwag benchmarks RLHF consistently improves over SFT (up to ~2.5 percentage points on HellaSwag with LLaMA). Gains are smaller on MMLU and on low-resource languages. All datasets, prompts, and models are released on GitHub.

Problem Statement

Most open-source instruction-tuned LLMs focus on English. There is little work applying RLHF (reward-based fine-tuning) to many languages because ranking data is scarce. That limits access and real-world usefulness for non-English speakers.

Main Contribution

Created a multilingual instruction corpus: 158K English instructions (52K Alpaca + 106K newly generated) then translated into 26 languages.

Produced ranked-response data (42K ranked examples per language) by translating examples to English and using ChatGPT as the ranker; used these to train reward models.

Key Findings

RLHF improves multilingual instruction-following over SFT on average.

NumbersBLOOM average accuracy: SFT 28.4 -> RLHF 30.0 (+1.6)

Practical UseIf you can build ranked feedback, expect modest but consistent performance gains from RLHF versus SFT on multilingual tasks like ARC and HellaSwag.

Evidence RefTable 3/4/5 averages

RLHF delivers larger gains on commonsense and basic-knowledge tasks than on professional knowledge tasks.

NumbersLLaMA HellaSwag: SFT 47.1 -> RLHF 49.6 (+2.5); LLaMA MMLU: SFT 30.1 -> RLHF 30.8 (+0.7)

Practical UseUse RLHF when your product targets general knowledge or commonsense tasks; expect smaller benefits for specialized professional domains.

Evidence RefTables 7 and 8

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	RLHF 30.0	SFT 28.4	+1.6	ARC average across 26 languages	Table 3 average rows	Table 3
Accuracy	RLHF 39.5	SFT 37.7	+1.8	HellaSwag average across 26 languages	Table 4 average rows	Table 4

What To Try In 7 Days

Translate a small set (10–20k) of your English instructions into target languages via a tested translator.

Generate 3–5 responses per prompt, use an automatic ranker (or small human panel) to collect preference labels.

Train a reward model and run PPO for a few epochs while freezing most layers; compare SFT vs RLHF on a held-out multilingual benchmark subset.

Agent Features

Frameworks

PPO

Architectures

decoder-only Transformer

Optimization Features

System Optimization

freeze most model layers during PPO; train top-4 layers only for RLHF

Training Optimization

SFTreward-model training from ranked pairsPPO-based RLHF with KL penalty

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/nlp-uoregon/Okapi

Data URLs

https://github.com/nlp-uoregon/Okapi

Risks & Boundaries

Limitations

Only 26 target languages; many world languages missing.

Base models limited to BLOOM-7B and LLaMA-7B; larger/smaller scales not evaluated.

When Not To Use

When you need rigorously human-curated instruction data in a language (ChatGPT translations may be noisy).

For high-stakes or safety-critical domains without human verification of outputs.

Failure Modes

Translation errors or inconsistencies from ChatGPT that propagate to training data.

Ranker bias from using ChatGPT as the preference judge instead of humans.

Core Entities

Models

BLOOM-7BLLaMA-7BBLOOMZ

Metrics

Accuracy

Datasets

Alpaca (52K)Okapi generated instructions (106K)ARC (translated)HellaSwag (translated)MMLU (translated)Self-Instruct

Okapi: first open-source RLHF instruction-tuned LLMs across 26 languages

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RLHF improves multilingual instruction-following over SFT on average.

RLHF delivers larger gains on commonsense and basic-knowledge tasks than on professional knowledge tasks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Datasets

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RLHF improves multilingual instruction-following over SFT on average.

RLHF delivers larger gains on commonsense and basic-knowledge tasks than on professional knowledge tasks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Datasets

You May Also Want to Read

Automatically pick high-quality instruction examples to finetune LLMs and cut training cost

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

Key finding

Let LLMs label and correct themselves: filter unknowns, prefer better answers, and reduce hallucinations

Key finding

Pick 5–15% of instruction data using gradient signal-to-noise from a LoRA ensemble to match or beat full-data fine-tuning

Key finding