Overview
The approach is practical and repeatable, but results are modest, sensitive to translation/ranking quality, and stronger on general-knowledge tasks than on specialized benchmarks.
Citations3
Evidence Strength0.70
Confidence0.90
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
If you need multilingual chat or QA features, investing in translated instructions and RLHF can yield measurable accuracy gains and broader language coverage while keeping models open-source.
Who Should Care
Summary TLDR
Okapi builds multilingual instruction datasets, ranked responses, and translated benchmarks in 26 languages and uses supervised fine-tuning (SFT) plus reinforcement learning from human feedback (RLHF) to produce multilingual LLMs. They generate 158K English instructions (52K Alpaca + 106K new), translate them, obtain 42K ranked responses per language, and train BLOOM-7B and LLaMA-7B variants. Across ARC and HellaSwag benchmarks RLHF consistently improves over SFT (up to ~2.5 percentage points on HellaSwag with LLaMA). Gains are smaller on MMLU and on low-resource languages. All datasets, prompts, and models are released on GitHub.
Problem Statement
Most open-source instruction-tuned LLMs focus on English. There is little work applying RLHF (reward-based fine-tuning) to many languages because ranking data is scarce. That limits access and real-world usefulness for non-English speakers.
Main Contribution
Created a multilingual instruction corpus: 158K English instructions (52K Alpaca + 106K newly generated) then translated into 26 languages.
Produced ranked-response data (42K ranked examples per language) by translating examples to English and using ChatGPT as the ranker; used these to train reward models.
Key Findings
RLHF improves multilingual instruction-following over SFT on average.
RLHF delivers larger gains on commonsense and basic-knowledge tasks than on professional knowledge tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | RLHF 30.0 | SFT 28.4 | +1.6 | ARC average across 26 languages | Table 3 average rows | Table 3 |
| Accuracy | RLHF 39.5 | SFT 37.7 | +1.8 | HellaSwag average across 26 languages | Table 4 average rows | Table 4 |
What To Try In 7 Days
Translate a small set (10–20k) of your English instructions into target languages via a tested translator.
Generate 3–5 responses per prompt, use an automatic ranker (or small human panel) to collect preference labels.
Train a reward model and run PPO for a few epochs while freezing most layers; compare SFT vs RLHF on a held-out multilingual benchmark subset.
Agent Features
Frameworks
Architectures
Optimization Features
System Optimization
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Only 26 target languages; many world languages missing.
Base models limited to BLOOM-7B and LLaMA-7B; larger/smaller scales not evaluated.
When Not To Use
When you need rigorously human-curated instruction data in a language (ChatGPT translations may be noisy).
For high-stakes or safety-critical domains without human verification of outputs.
Failure Modes
Translation errors or inconsistencies from ChatGPT that propagate to training data.
Ranker bias from using ChatGPT as the preference judge instead of humans.

