SeaLLMs: language models tuned and tokenized for Southeast Asian languages

December 1, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

7

Authors

Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Zhiqiang Hu, Chenhui Shen, Yew Ken Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, Lidong Bing

Links

Abstract / PDF

Why It Matters For Business

SeaLLMs let companies offer cheaper, smaller models that serve Southeast Asian languages better than general English-centric models, improving UX and reducing API costs for these markets.

Summary TLDR

SeaLLMs are a family of LLMs adapted to Southeast Asian (SEA) languages. The team extended tokenizers for non‑Latin scripts, continued pretraining from popular backbones, used multilingual SFT and a self‑preference alignment step. Evaluations (Sea-bench, M3Exam, GSM8K/MATH, Flores-200) show strong gains over open models and ChatGPT-3.5 on low-resource, non‑Latin SEA languages while remaining compact (7B–13B).

Problem Statement

Most large models favor high-resource languages. Non‑Latin SEA languages suffer high tokenization cost and data scarcity, causing worse accuracy and instruction following. There was also no assistant-style multilingual benchmark covering these languages.

Main Contribution

SeaLLM models (7B and 13B) specialized for SEA languages via tokenizer expansion, continued pretraining, multilingual SFT, and self‑preference alignment.

A vocabulary expansion algorithm that imports tokens from a rich multilingual tokenizer (NLLB) and prunes rare tokens to reduce tokenization bloat for non‑Latin scripts.

Sea-bench: a multilingual, assistant-style test set (built with native linguists) plus GPT-4-based grading to evaluate performance across SEA languages and categories.

Demonstrated that compact models (7B) can match or beat larger or closed-source baselines on low-resource SEA languages and certain reasoning tasks.

Key Findings

Vocabulary expansion sharply reduced token cost for non‑Latin SEA scripts.

NumbersThai token ratio improved from 9.09→1.87 (SeaLLM's, Table 1)

SeaLLM-7B-v2.5 is competitive with GPT-3.5 on multilingual world knowledge at its scale.

NumbersM3Exam English: 76.87 (SeaLLM-7B-v2.5) vs 75.46 (ChatGPT-3.5), Table 2

Large gains versus ChatGPT-3.5 on low-resource non‑Latin languages.

NumbersThai M3Exam: 46.86 vs 37.41 (+9.45); Burmese, Khmer, Lao also large gains (Table 2 & Fig.5)

Strong math/ reasoning after targeted SFT and scaling of synthetic data.

NumbersGSM8K: 78.5 (SeaLLM-7B-v2.5) vs 80.8 (ChatGPT-3.5); MATH: 34.9 vs 34.1 (Table 4)

Sea-bench created and used with GPT-4 as a judge for assistant-style evaluation.

NumbersSea-bench evaluates 5 categories across 9 languages (Section 4.2)

Results

M3Exam (English)

ValueSeaLLM-7B-v2.5 76.87

BaselineChatGPT-3.5 75.46

M3Exam (Thai)

ValueSeaLLM-7B-v2.5 46.86

BaselineChatGPT-3.5 37.41

MT-bench (English assistant score)

ValueSeaLLM-7B-v2 7.54

BaselineGPT-4-turbo 9.32

GSM8K (math)

ValueSeaLLM-7B-v2.5 78.5

BaselineChatGPT-3.5 80.8

MATH (math benchmark)

ValueSeaLLM-7B-v2.5 34.9

BaselineChatGPT-3.5 34.1

Who Should Care

What To Try In 7 Days

Test SeaLLM-7B-v2/v2.5 on your Thai/Khmer/Burmese user flows to measure user-facing quality gains.

Replace heavy English-only pipelines with SeaLLM tokenizers when ingesting non‑Latin SEA text to reduce token costs.

Use Sea-bench examples to augment your internal multilingual evaluation set and find failure modes quickly.

Optimization Features

Token Efficiency

  • reduced tokenization cost for non-Latin scripts (see Table 1)

Model Optimization

  • vocabulary expansion to add language-specific tokens

Training Optimization

  • SFT
  • self-preferencing direct preference optimization (no external RLHF)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Covers 9 common SEA languages but omits many others (e.g., Javanese, Tamil).
  • Models still exhibit moderate hallucination and degeneration for some languages (Burmese, Lao).
  • Sea-bench uses GPT-4 as judge, which may carry judge bias and tokenization blind spots for non-Latin text.

When Not To Use

  • When you need coverage for SEA languages not included in the nine supported languages.
  • If absolute hallucination-free output is required in low-resource languages.
  • Where regulatory or privacy constraints forbid use of models trained on web-scraped corpora without full provenance.

Failure Modes

  • Hallucination and response degeneration in certain low-resource languages (noted for Burmese and Lao).
  • Residual tokenization inefficiency if tokenizer extension is not applied for older backbones.
  • Possible judge bias from GPT-4 in Sea-bench evaluations for non‑Latin scripts.

Core Entities

Models

  • SeaLLM-7B-v1
  • SeaLLM-13B-v1
  • SeaLLM-7B-v2
  • SeaLLM-7B-v2.5
  • Llama-2-7B
  • Llama-2-13B
  • Mistral-7B
  • Gemma-7B

Metrics

  • chrF++ (translation)
  • compression ratio (tokenization)
  • Accuracy
  • MATH score
  • MT-bench score

Datasets

  • Sea-bench
  • M3Exam
  • MMLU
  • GSM8K
  • MATH
  • Flores-200
  • RedPajama
  • CommonCrawl
  • CC-News
  • Wikipedia

Benchmarks

  • Sea-bench
  • MT-bench
  • M3Exam
  • MMLU
  • GSM8K
  • MATH
  • Flores-200