SeaLLMs: language models tuned and tokenized for Southeast Asian languages

Overview

Decision SnapshotReady For Pilot

The paper gives concrete evaluation numbers across multiple benchmarks and languages, but some datasets and full training artifacts are not fully detailed, so practical adoption needs in-house validation.

Citations7

Evidence Strength0.75

Confidence0.70

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Zhiqiang Hu, Chenhui Shen, Yew Ken Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, Liying Cheng, Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, Lidong Bing

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SeaLLMs let companies offer cheaper, smaller models that serve Southeast Asian languages better than general English-centric models, improving UX and reducing API costs for these markets.

Who Should Care

ML Engineer Product Manager Data Scientist CTO

Summary TLDR

SeaLLMs are a family of LLMs adapted to Southeast Asian (SEA) languages. The team extended tokenizers for non‑Latin scripts, continued pretraining from popular backbones, used multilingual SFT and a self‑preference alignment step. Evaluations (Sea-bench, M3Exam, GSM8K/MATH, Flores-200) show strong gains over open models and ChatGPT-3.5 on low-resource, non‑Latin SEA languages while remaining compact (7B–13B).

Problem Statement

Most large models favor high-resource languages. Non‑Latin SEA languages suffer high tokenization cost and data scarcity, causing worse accuracy and instruction following. There was also no assistant-style multilingual benchmark covering these languages.

Main Contribution

SeaLLM models (7B and 13B) specialized for SEA languages via tokenizer expansion, continued pretraining, multilingual SFT, and self‑preference alignment.

A vocabulary expansion algorithm that imports tokens from a rich multilingual tokenizer (NLLB) and prunes rare tokens to reduce tokenization bloat for non‑Latin scripts.

Key Findings

Vocabulary expansion sharply reduced token cost for non‑Latin SEA scripts.

NumbersThai token ratio improved from 9.09→1.87 (SeaLLM's, Table 1)

Practical UseAdd language-specific tokens before pretraining to fit more non‑Latin text into the same context window and improve model utility in those languages.

Evidence RefTable 1 (Vocabulary compression ratios)

SeaLLM-7B-v2.5 is competitive with GPT-3.5 on multilingual world knowledge at its scale.

NumbersM3Exam English: 76.87 (SeaLLM-7B-v2.5) vs 75.46 (ChatGPT-3.5), Table 2

Practical UseA tuned 7B model can replace larger or closed models for many SEA-language knowledge tasks, reducing cost.

Evidence RefTable 2 (M3Exam / MMLU comparisons)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
M3Exam (English)	SeaLLM-7B-v2.5 76.87	ChatGPT-3.5 75.46	+1.41	M3Exam (Eng)	Table 2 shows M3Exam scores across models	Table 2
M3Exam (Thai)	SeaLLM-7B-v2.5 46.86	ChatGPT-3.5 37.41	+9.45	M3Exam (Tha)	Table 2 shows larger gains in non-Latin languages	Table 2

What To Try In 7 Days

Test SeaLLM-7B-v2/v2.5 on your Thai/Khmer/Burmese user flows to measure user-facing quality gains.

Replace heavy English-only pipelines with SeaLLM tokenizers when ingesting non‑Latin SEA text to reduce token costs.

Use Sea-bench examples to augment your internal multilingual evaluation set and find failure modes quickly.

Optimization Features

Token Efficiency

reduced tokenization cost for non-Latin scripts (see Table 1)

Model Optimization

vocabulary expansion to add language-specific tokens

Training Optimization

SFTself-preferencing direct preference optimization (no external RLHF)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/DAMO-NLP-SG/SeaLLMs

Data URLs

https://huggingface.co/datasets (RedPajama, CC-News, Wikipedia, CommonCrawl)Flores-200 (translation test sets)

Risks & Boundaries

Limitations

Covers 9 common SEA languages but omits many others (e.g., Javanese, Tamil).

Models still exhibit moderate hallucination and degeneration for some languages (Burmese, Lao).

When Not To Use

When you need coverage for SEA languages not included in the nine supported languages.

If absolute hallucination-free output is required in low-resource languages.

Failure Modes

Hallucination and response degeneration in certain low-resource languages (noted for Burmese and Lao).

Residual tokenization inefficiency if tokenizer extension is not applied for older backbones.

Core Entities

Models

SeaLLM-7B-v1SeaLLM-13B-v1SeaLLM-7B-v2SeaLLM-7B-v2.5Llama-2-7BLlama-2-13BMistral-7BGemma-7B

Metrics

chrF++ (translation)compression ratio (tokenization)AccuracyMATH scoreMT-bench score

Datasets

Sea-benchM3ExamMMLUGSM8KMATHFlores-200RedPajamaCommonCrawlCC-NewsWikipedia

Benchmarks

Sea-benchMT-benchM3ExamMMLUGSM8KMATHFlores-200

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Vocabulary expansion sharply reduced token cost for non‑Latin SEA scripts.

SeaLLM-7B-v2.5 is competitive with GPT-3.5 on multilingual world knowledge at its scale.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding