Overview
The paper gives concrete evaluation numbers across multiple benchmarks and languages, but some datasets and full training artifacts are not fully detailed, so practical adoption needs in-house validation.
Citations7
Evidence Strength0.75
Confidence0.70
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
SeaLLMs let companies offer cheaper, smaller models that serve Southeast Asian languages better than general English-centric models, improving UX and reducing API costs for these markets.
Who Should Care
Summary TLDR
SeaLLMs are a family of LLMs adapted to Southeast Asian (SEA) languages. The team extended tokenizers for non‑Latin scripts, continued pretraining from popular backbones, used multilingual SFT and a self‑preference alignment step. Evaluations (Sea-bench, M3Exam, GSM8K/MATH, Flores-200) show strong gains over open models and ChatGPT-3.5 on low-resource, non‑Latin SEA languages while remaining compact (7B–13B).
Problem Statement
Most large models favor high-resource languages. Non‑Latin SEA languages suffer high tokenization cost and data scarcity, causing worse accuracy and instruction following. There was also no assistant-style multilingual benchmark covering these languages.
Main Contribution
SeaLLM models (7B and 13B) specialized for SEA languages via tokenizer expansion, continued pretraining, multilingual SFT, and self‑preference alignment.
A vocabulary expansion algorithm that imports tokens from a rich multilingual tokenizer (NLLB) and prunes rare tokens to reduce tokenization bloat for non‑Latin scripts.
Key Findings
Vocabulary expansion sharply reduced token cost for non‑Latin SEA scripts.
SeaLLM-7B-v2.5 is competitive with GPT-3.5 on multilingual world knowledge at its scale.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| M3Exam (English) | SeaLLM-7B-v2.5 76.87 | ChatGPT-3.5 75.46 | +1.41 | M3Exam (Eng) | Table 2 shows M3Exam scores across models | Table 2 |
| M3Exam (Thai) | SeaLLM-7B-v2.5 46.86 | ChatGPT-3.5 37.41 | +9.45 | M3Exam (Tha) | Table 2 shows larger gains in non-Latin languages | Table 2 |
What To Try In 7 Days
Test SeaLLM-7B-v2/v2.5 on your Thai/Khmer/Burmese user flows to measure user-facing quality gains.
Replace heavy English-only pipelines with SeaLLM tokenizers when ingesting non‑Latin SEA text to reduce token costs.
Use Sea-bench examples to augment your internal multilingual evaluation set and find failure modes quickly.
Optimization Features
Token Efficiency
Model Optimization
Training Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Covers 9 common SEA languages but omits many others (e.g., Javanese, Tamil).
Models still exhibit moderate hallucination and degeneration for some languages (Burmese, Lao).
When Not To Use
When you need coverage for SEA languages not included in the nine supported languages.
If absolute hallucination-free output is required in low-resource languages.
Failure Modes
Hallucination and response degeneration in certain low-resource languages (noted for Burmese and Lao).
Residual tokenization inefficiency if tokenizer extension is not applied for older backbones.

