Okapi: first open-source RLHF instruction-tuned LLMs across 26 languages

July 29, 20237 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

3

Authors

Viet Dac Lai, Chien Van Nguyen, Nghia Trung Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan A. Rossi, Thien Huu Nguyen

Links

Abstract / PDF

Why It Matters For Business

If you need multilingual chat or QA features, investing in translated instructions and RLHF can yield measurable accuracy gains and broader language coverage while keeping models open-source.

Summary TLDR

Okapi builds multilingual instruction datasets, ranked responses, and translated benchmarks in 26 languages and uses supervised fine-tuning (SFT) plus reinforcement learning from human feedback (RLHF) to produce multilingual LLMs. They generate 158K English instructions (52K Alpaca + 106K new), translate them, obtain 42K ranked responses per language, and train BLOOM-7B and LLaMA-7B variants. Across ARC and HellaSwag benchmarks RLHF consistently improves over SFT (up to ~2.5 percentage points on HellaSwag with LLaMA). Gains are smaller on MMLU and on low-resource languages. All datasets, prompts, and models are released on GitHub.

Problem Statement

Most open-source instruction-tuned LLMs focus on English. There is little work applying RLHF (reward-based fine-tuning) to many languages because ranking data is scarce. That limits access and real-world usefulness for non-English speakers.

Main Contribution

Created a multilingual instruction corpus: 158K English instructions (52K Alpaca + 106K newly generated) then translated into 26 languages.

Produced ranked-response data (42K ranked examples per language) by translating examples to English and using ChatGPT as the ranker; used these to train reward models.

Built and released RLHF and SFT fine-tuned models (BLOOM-7B and LLaMA-7B), and translated three evaluation benchmarks (ARC, HellaSwag, MMLU) into 26 languages; experiments show RLHF > SFT on most tasks/languages.

Key Findings

RLHF improves multilingual instruction-following over SFT on average.

NumbersBLOOM average accuracy: SFT 28.4 -> RLHF 30.0 (+1.6)

RLHF delivers larger gains on commonsense and basic-knowledge tasks than on professional knowledge tasks.

NumbersLLaMA HellaSwag: SFT 47.1 -> RLHF 49.6 (+2.5); LLaMA MMLU: SFT 30.1 -> RLHF 30.8 (+0.7)

Low-resource languages see the smallest gains from instruction tuning and RLHF.

NumbersGroup averages show high-resource > medium-resource > low-resource; RLHF gains smallest for low-resource group (BLOOM)

Okapi's instruction set beats a large cross-lingual baseline on many tasks.

NumbersRLHF > BLOOMZ by ~4.8% on HellaSwag average (BLOOM-based models)

Results

Accuracy

ValueRLHF 30.0

BaselineSFT 28.4

Accuracy

ValueRLHF 39.5

BaselineSFT 37.7

Accuracy

ValueRLHF 49.6

BaselineSFT 47.1

Accuracy

ValueRLHF 26.9

BaselineSFT 26.6

Who Should Care

What To Try In 7 Days

Translate a small set (10–20k) of your English instructions into target languages via a tested translator.

Generate 3–5 responses per prompt, use an automatic ranker (or small human panel) to collect preference labels.

Train a reward model and run PPO for a few epochs while freezing most layers; compare SFT vs RLHF on a held-out multilingual benchmark subset.

Agent Features

Frameworks

  • PPO

Architectures

  • decoder-only Transformer

Optimization Features

System Optimization

  • freeze most model layers during PPO; train top-4 layers only for RLHF

Training Optimization

  • SFT
  • reward-model training from ranked pairs
  • PPO-based RLHF with KL penalty

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Only 26 target languages; many world languages missing.
  • Base models limited to BLOOM-7B and LLaMA-7B; larger/smaller scales not evaluated.
  • Instruction and ranking data were generated and translated using ChatGPT, which can introduce noise and bias.
  • Evaluation covers knowledge and reasoning benchmarks only; toxicity, hallucination, and fairness are not measured.

When Not To Use

  • When you need rigorously human-curated instruction data in a language (ChatGPT translations may be noisy).
  • For high-stakes or safety-critical domains without human verification of outputs.
  • If you lack ranking labels or a way to validate automatic rankers; RLHF needs reliable preference signals.

Failure Modes

  • Translation errors or inconsistencies from ChatGPT that propagate to training data.
  • Ranker bias from using ChatGPT as the preference judge instead of humans.
  • Limited or no improvement on specialized professional knowledge tasks (MMLU).
  • Small gains for low-resource languages; model may underperform without extra data or native feedback.

Core Entities

Models

  • BLOOM-7B
  • LLaMA-7B
  • BLOOMZ

Metrics

  • Accuracy

Datasets

  • Alpaca (52K)
  • Okapi generated instructions (106K)
  • ARC (translated)
  • HellaSwag (translated)
  • MMLU (translated)
  • Self-Instruct

Benchmarks

  • ARC
  • HellaSwag
  • MMLU

Context Entities

Datasets

  • CommonCrawl (for language resource categorization)