Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
7
Why It Matters For Business
SeaLLMs let companies offer cheaper, smaller models that serve Southeast Asian languages better than general English-centric models, improving UX and reducing API costs for these markets.
Summary TLDR
SeaLLMs are a family of LLMs adapted to Southeast Asian (SEA) languages. The team extended tokenizers for non‑Latin scripts, continued pretraining from popular backbones, used multilingual SFT and a self‑preference alignment step. Evaluations (Sea-bench, M3Exam, GSM8K/MATH, Flores-200) show strong gains over open models and ChatGPT-3.5 on low-resource, non‑Latin SEA languages while remaining compact (7B–13B).
Problem Statement
Most large models favor high-resource languages. Non‑Latin SEA languages suffer high tokenization cost and data scarcity, causing worse accuracy and instruction following. There was also no assistant-style multilingual benchmark covering these languages.
Main Contribution
SeaLLM models (7B and 13B) specialized for SEA languages via tokenizer expansion, continued pretraining, multilingual SFT, and self‑preference alignment.
A vocabulary expansion algorithm that imports tokens from a rich multilingual tokenizer (NLLB) and prunes rare tokens to reduce tokenization bloat for non‑Latin scripts.
Sea-bench: a multilingual, assistant-style test set (built with native linguists) plus GPT-4-based grading to evaluate performance across SEA languages and categories.
Demonstrated that compact models (7B) can match or beat larger or closed-source baselines on low-resource SEA languages and certain reasoning tasks.
Key Findings
Vocabulary expansion sharply reduced token cost for non‑Latin SEA scripts.
SeaLLM-7B-v2.5 is competitive with GPT-3.5 on multilingual world knowledge at its scale.
Large gains versus ChatGPT-3.5 on low-resource non‑Latin languages.
Strong math/ reasoning after targeted SFT and scaling of synthetic data.
Sea-bench created and used with GPT-4 as a judge for assistant-style evaluation.
Results
M3Exam (English)
M3Exam (Thai)
MT-bench (English assistant score)
GSM8K (math)
MATH (math benchmark)
Who Should Care
What To Try In 7 Days
Test SeaLLM-7B-v2/v2.5 on your Thai/Khmer/Burmese user flows to measure user-facing quality gains.
Replace heavy English-only pipelines with SeaLLM tokenizers when ingesting non‑Latin SEA text to reduce token costs.
Use Sea-bench examples to augment your internal multilingual evaluation set and find failure modes quickly.
Optimization Features
Token Efficiency
- reduced tokenization cost for non-Latin scripts (see Table 1)
Model Optimization
- vocabulary expansion to add language-specific tokens
Training Optimization
- SFT
- self-preferencing direct preference optimization (no external RLHF)
Reproducibility
Data Urls
- https://huggingface.co/datasets (RedPajama, CC-News, Wikipedia, CommonCrawl)
- Flores-200 (translation test sets)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Covers 9 common SEA languages but omits many others (e.g., Javanese, Tamil).
- Models still exhibit moderate hallucination and degeneration for some languages (Burmese, Lao).
- Sea-bench uses GPT-4 as judge, which may carry judge bias and tokenization blind spots for non-Latin text.
When Not To Use
- When you need coverage for SEA languages not included in the nine supported languages.
- If absolute hallucination-free output is required in low-resource languages.
- Where regulatory or privacy constraints forbid use of models trained on web-scraped corpora without full provenance.
Failure Modes
- Hallucination and response degeneration in certain low-resource languages (noted for Burmese and Lao).
- Residual tokenization inefficiency if tokenizer extension is not applied for older backbones.
- Possible judge bias from GPT-4 in Sea-bench evaluations for non‑Latin scripts.
Core Entities
Models
- SeaLLM-7B-v1
- SeaLLM-13B-v1
- SeaLLM-7B-v2
- SeaLLM-7B-v2.5
- Llama-2-7B
- Llama-2-13B
- Mistral-7B
- Gemma-7B
Metrics
- chrF++ (translation)
- compression ratio (tokenization)
- Accuracy
- MATH score
- MT-bench score
Datasets
- Sea-bench
- M3Exam
- MMLU
- GSM8K
- MATH
- Flores-200
- RedPajama
- CommonCrawl
- CC-News
- Wikipedia
Benchmarks
- Sea-bench
- MT-bench
- M3Exam
- MMLU
- GSM8K
- MATH
- Flores-200

