SeaLLMs: language models tuned and tokenized for Southeast Asian languages

December 1, 20237 min

Overview

Decision SnapshotReady For Pilot

The paper gives concrete evaluation numbers across multiple benchmarks and languages, but some datasets and full training artifacts are not fully detailed, so practical adoption needs in-house validation.

Citations7

Evidence Strength0.75

Confidence0.70

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Zhiqiang Hu, Chenhui Shen, Yew Ken Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, Lidong Bing

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SeaLLMs let companies offer cheaper, smaller models that serve Southeast Asian languages better than general English-centric models, improving UX and reducing API costs for these markets.

Who Should Care

Summary TLDR

SeaLLMs are a family of LLMs adapted to Southeast Asian (SEA) languages. The team extended tokenizers for non‑Latin scripts, continued pretraining from popular backbones, used multilingual SFT and a self‑preference alignment step. Evaluations (Sea-bench, M3Exam, GSM8K/MATH, Flores-200) show strong gains over open models and ChatGPT-3.5 on low-resource, non‑Latin SEA languages while remaining compact (7B–13B).

Problem Statement

Most large models favor high-resource languages. Non‑Latin SEA languages suffer high tokenization cost and data scarcity, causing worse accuracy and instruction following. There was also no assistant-style multilingual benchmark covering these languages.

Main Contribution

SeaLLM models (7B and 13B) specialized for SEA languages via tokenizer expansion, continued pretraining, multilingual SFT, and self‑preference alignment.

A vocabulary expansion algorithm that imports tokens from a rich multilingual tokenizer (NLLB) and prunes rare tokens to reduce tokenization bloat for non‑Latin scripts.

Key Findings

Vocabulary expansion sharply reduced token cost for non‑Latin SEA scripts.

NumbersThai token ratio improved from 9.091.87 (SeaLLM's, Table 1)

Practical UseAdd language-specific tokens before pretraining to fit more non‑Latin text into the same context window and improve model utility in those languages.

Evidence RefTable 1 (Vocabulary compression ratios)

SeaLLM-7B-v2.5 is competitive with GPT-3.5 on multilingual world knowledge at its scale.

NumbersM3Exam English: 76.87 (SeaLLM-7B-v2.5) vs 75.46 (ChatGPT-3.5), Table 2

Practical UseA tuned 7B model can replace larger or closed models for many SEA-language knowledge tasks, reducing cost.

Evidence RefTable 2 (M3Exam / MMLU comparisons)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
M3Exam (English)SeaLLM-7B-v2.5 76.87ChatGPT-3.5 75.46+1.41M3Exam (Eng)Table 2 shows M3Exam scores across modelsTable 2
M3Exam (Thai)SeaLLM-7B-v2.5 46.86ChatGPT-3.5 37.41+9.45M3Exam (Tha)Table 2 shows larger gains in non-Latin languagesTable 2

What To Try In 7 Days

Test SeaLLM-7B-v2/v2.5 on your Thai/Khmer/Burmese user flows to measure user-facing quality gains.

Replace heavy English-only pipelines with SeaLLM tokenizers when ingesting non‑Latin SEA text to reduce token costs.

Use Sea-bench examples to augment your internal multilingual evaluation set and find failure modes quickly.

Optimization Features

Token Efficiency
reduced tokenization cost for non-Latin scripts (see Table 1)
Model Optimization
vocabulary expansion to add language-specific tokens
Training Optimization
SFTself-preferencing direct preference optimization (no external RLHF)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Covers 9 common SEA languages but omits many others (e.g., Javanese, Tamil).

Models still exhibit moderate hallucination and degeneration for some languages (Burmese, Lao).

When Not To Use

When you need coverage for SEA languages not included in the nine supported languages.

If absolute hallucination-free output is required in low-resource languages.

Failure Modes

Hallucination and response degeneration in certain low-resource languages (noted for Burmese and Lao).

Residual tokenization inefficiency if tokenizer extension is not applied for older backbones.

Core Entities

Models

SeaLLM-7B-v1SeaLLM-13B-v1SeaLLM-7B-v2SeaLLM-7B-v2.5Llama-2-7BLlama-2-13BMistral-7BGemma-7B

Metrics

chrF++ (translation)compression ratio (tokenization)AccuracyMATH scoreMT-bench score

Datasets

Sea-benchM3ExamMMLUGSM8KMATHFlores-200RedPajamaCommonCrawlCC-NewsWikipedia

Benchmarks

Sea-benchMT-benchM3ExamMMLUGSM8KMATHFlores-200