Okapi: first open-source RLHF instruction-tuned LLMs across 26 languages

July 29, 20237 min

Overview

Decision SnapshotNeeds Validation

The approach is practical and repeatable, but results are modest, sensitive to translation/ranking quality, and stronger on general-knowledge tasks than on specialized benchmarks.

Citations3

Evidence Strength0.70

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Viet Dac Lai, Chien Van Nguyen, Nghia Trung Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan A. Rossi, Thien Huu Nguyen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you need multilingual chat or QA features, investing in translated instructions and RLHF can yield measurable accuracy gains and broader language coverage while keeping models open-source.

Who Should Care

Summary TLDR

Okapi builds multilingual instruction datasets, ranked responses, and translated benchmarks in 26 languages and uses supervised fine-tuning (SFT) plus reinforcement learning from human feedback (RLHF) to produce multilingual LLMs. They generate 158K English instructions (52K Alpaca + 106K new), translate them, obtain 42K ranked responses per language, and train BLOOM-7B and LLaMA-7B variants. Across ARC and HellaSwag benchmarks RLHF consistently improves over SFT (up to ~2.5 percentage points on HellaSwag with LLaMA). Gains are smaller on MMLU and on low-resource languages. All datasets, prompts, and models are released on GitHub.

Problem Statement

Most open-source instruction-tuned LLMs focus on English. There is little work applying RLHF (reward-based fine-tuning) to many languages because ranking data is scarce. That limits access and real-world usefulness for non-English speakers.

Main Contribution

Created a multilingual instruction corpus: 158K English instructions (52K Alpaca + 106K newly generated) then translated into 26 languages.

Produced ranked-response data (42K ranked examples per language) by translating examples to English and using ChatGPT as the ranker; used these to train reward models.

Key Findings

RLHF improves multilingual instruction-following over SFT on average.

NumbersBLOOM average accuracy: SFT 28.4 -> RLHF 30.0 (+1.6)

Practical UseIf you can build ranked feedback, expect modest but consistent performance gains from RLHF versus SFT on multilingual tasks like ARC and HellaSwag.

Evidence RefTable 3/4/5 averages

RLHF delivers larger gains on commonsense and basic-knowledge tasks than on professional knowledge tasks.

NumbersLLaMA HellaSwag: SFT 47.1 -> RLHF 49.6 (+2.5); LLaMA MMLU: SFT 30.1 -> RLHF 30.8 (+0.7)

Practical UseUse RLHF when your product targets general knowledge or commonsense tasks; expect smaller benefits for specialized professional domains.

Evidence RefTables 7 and 8

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyRLHF 30.0SFT 28.4+1.6ARC average across 26 languagesTable 3 average rowsTable 3
AccuracyRLHF 39.5SFT 37.7+1.8HellaSwag average across 26 languagesTable 4 average rowsTable 4

What To Try In 7 Days

Translate a small set (10–20k) of your English instructions into target languages via a tested translator.

Generate 3–5 responses per prompt, use an automatic ranker (or small human panel) to collect preference labels.

Train a reward model and run PPO for a few epochs while freezing most layers; compare SFT vs RLHF on a held-out multilingual benchmark subset.

Agent Features

Frameworks
PPO
Architectures
decoder-only Transformer

Optimization Features

System Optimization
freeze most model layers during PPO; train top-4 layers only for RLHF
Training Optimization
SFTreward-model training from ranked pairsPPO-based RLHF with KL penalty

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Only 26 target languages; many world languages missing.

Base models limited to BLOOM-7B and LLaMA-7B; larger/smaller scales not evaluated.

When Not To Use

When you need rigorously human-curated instruction data in a language (ChatGPT translations may be noisy).

For high-stakes or safety-critical domains without human verification of outputs.

Failure Modes

Translation errors or inconsistencies from ChatGPT that propagate to training data.

Ranker bias from using ChatGPT as the preference judge instead of humans.

Core Entities

Models

BLOOM-7BLLaMA-7BBLOOMZ

Metrics

Accuracy

Datasets

Alpaca (52K)Okapi generated instructions (106K)ARC (translated)HellaSwag (translated)MMLU (translated)Self-Instruct

Benchmarks

ARCHellaSwagMMLU

Context Entities

Datasets

CommonCrawl (for language resource categorization)