Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
3
Why It Matters For Business
If you need multilingual chat or QA features, investing in translated instructions and RLHF can yield measurable accuracy gains and broader language coverage while keeping models open-source.
Summary TLDR
Okapi builds multilingual instruction datasets, ranked responses, and translated benchmarks in 26 languages and uses supervised fine-tuning (SFT) plus reinforcement learning from human feedback (RLHF) to produce multilingual LLMs. They generate 158K English instructions (52K Alpaca + 106K new), translate them, obtain 42K ranked responses per language, and train BLOOM-7B and LLaMA-7B variants. Across ARC and HellaSwag benchmarks RLHF consistently improves over SFT (up to ~2.5 percentage points on HellaSwag with LLaMA). Gains are smaller on MMLU and on low-resource languages. All datasets, prompts, and models are released on GitHub.
Problem Statement
Most open-source instruction-tuned LLMs focus on English. There is little work applying RLHF (reward-based fine-tuning) to many languages because ranking data is scarce. That limits access and real-world usefulness for non-English speakers.
Main Contribution
Created a multilingual instruction corpus: 158K English instructions (52K Alpaca + 106K newly generated) then translated into 26 languages.
Produced ranked-response data (42K ranked examples per language) by translating examples to English and using ChatGPT as the ranker; used these to train reward models.
Built and released RLHF and SFT fine-tuned models (BLOOM-7B and LLaMA-7B), and translated three evaluation benchmarks (ARC, HellaSwag, MMLU) into 26 languages; experiments show RLHF > SFT on most tasks/languages.
Key Findings
RLHF improves multilingual instruction-following over SFT on average.
RLHF delivers larger gains on commonsense and basic-knowledge tasks than on professional knowledge tasks.
Low-resource languages see the smallest gains from instruction tuning and RLHF.
Okapi's instruction set beats a large cross-lingual baseline on many tasks.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Translate a small set (10–20k) of your English instructions into target languages via a tested translator.
Generate 3–5 responses per prompt, use an automatic ranker (or small human panel) to collect preference labels.
Train a reward model and run PPO for a few epochs while freezing most layers; compare SFT vs RLHF on a held-out multilingual benchmark subset.
Agent Features
Frameworks
- PPO
Architectures
- decoder-only Transformer
Optimization Features
System Optimization
- freeze most model layers during PPO; train top-4 layers only for RLHF
Training Optimization
- SFT
- reward-model training from ranked pairs
- PPO-based RLHF with KL penalty
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Only 26 target languages; many world languages missing.
- Base models limited to BLOOM-7B and LLaMA-7B; larger/smaller scales not evaluated.
- Instruction and ranking data were generated and translated using ChatGPT, which can introduce noise and bias.
- Evaluation covers knowledge and reasoning benchmarks only; toxicity, hallucination, and fairness are not measured.
When Not To Use
- When you need rigorously human-curated instruction data in a language (ChatGPT translations may be noisy).
- For high-stakes or safety-critical domains without human verification of outputs.
- If you lack ranking labels or a way to validate automatic rankers; RLHF needs reliable preference signals.
Failure Modes
- Translation errors or inconsistencies from ChatGPT that propagate to training data.
- Ranker bias from using ChatGPT as the preference judge instead of humans.
- Limited or no improvement on specialized professional knowledge tasks (MMLU).
- Small gains for low-resource languages; model may underperform without extra data or native feedback.
Core Entities
Models
- BLOOM-7B
- LLaMA-7B
- BLOOMZ
Metrics
- Accuracy
Datasets
- Alpaca (52K)
- Okapi generated instructions (106K)
- ARC (translated)
- HellaSwag (translated)
- MMLU (translated)
- Self-Instruct
Benchmarks
- ARC
- HellaSwag
- MMLU
Context Entities
Datasets
- CommonCrawl (for language resource categorization)

