Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
7
Why It Matters For Business
Typhoon gives companies a ready open-source Thai LLM that saves token costs (≈2.6×) and outperforms other open Thai models on exams and many Thai tasks, reducing engineering time versus building a Thai model from scratch.
Summary TLDR
Typhoon is a 7-billion-parameter LLM adapted from Mistral-7B and further trained on cleaned Thai web data plus English to avoid forgetting. The team built ThaiExam (a multi-exam multiple-choice benchmark) and a set of Thai instruction datasets. Typhoon outperforms other open-source Thai models on Thai exams, reaches near GPT-3.5 parity on several Thai tasks after instruction-tuning, and uses a tokenizer that is 2.62× more token-efficient for Thai text. Model weights are available under Apache-2.0.
Problem Statement
Thai is under-represented in standard pretraining corpora (e.g., <0.5% of Common Crawl). Generic and multilingual LLMs can miss Thai-specific facts, style, and cultural norms. The paper asks: can we adapt a strong English-centric LLM to Thai efficiently, and how to measure Thai knowledge reliably?
Main Contribution
Typhoon-7B: a Thai-focused 7B LLM adapted from Mistral-7B with continued pretraining on cleaned Thai+English data.
ThaiExam: a new benchmark assembled from Thai national and professional exams to measure Thai knowledge.
Instruction-tuned Typhoon-7B-Instruct using translated and self-instruct Thai data, evaluated via LLM-as-judge and standard NLP tasks.
Tokenizer and data pipeline work that produces 2.62× token efficiency on Thai and a cleaned 3 TB Thai text corpus.
Key Findings
Typhoon is the best open-source Thai LLM on evaluated Thai benchmarks.
Typhoon's Thai tokenizer is 2.62× more efficient than GPT-4 on Thai text.
Instruction tuning using translated and self-instruct Thai data yields competitive instruction-following.
Typhoon performs well on standard NLP tasks in zero-shot settings.
LoRA-based continued training on embeddings and the LM head was effective for adaptation.
Results
Accuracy
Tokenizer efficiency (relative to GPT-4)
Instruction-following win-rate (vs GPT-3.5)
Instruction-following win-rate (vs GPT-3.5)
Machine translation BLEU / chrF (En→Th)
Question-answering XQuAD F1 (0-shot / 1-shot)
Who Should Care
What To Try In 7 Days
Download Typhoon-7B from HuggingFace and run a quick QA/translation smoke test.
Measure token counts and cost with Typhoon tokenizer vs your current model on sample Thai traffic.
Fine-tune Typhoon with a small in-house Thai instruction set via LoRA for domain-specific responses.
Optimization Features
Token Efficiency
- Tokenizer yields 2.62× fewer tokens for Thai vs GPT-4 tokenizer
Model Optimization
- LoRA
Training Optimization
- Mixed Thai/English 50/50 data to mitigate catastrophic forgetting
- Large batch sizes (2M tokens) stabilized training
Inference Optimization
- Smaller vocabulary expansion and Thai subword tokens to reduce token counts
Reproducibility
License
- Apache-2.0
Open Source Status
- yes
Risks & Boundaries
Limitations
- May hallucinate or produce incorrect facts (not fully mitigated)
- Shows repetition in generated text in some cases
- Instruction-following is uneven: worse on some datasets like Thai OASST and Sea-bench
When Not To Use
- For high-stakes factual systems without extra verification
- If you require guaranteed safety filtering and RLHF-level alignment
Failure Modes
- Hallucination: plausible but wrong facts
- Degraded performance on out-of-distribution instructions or poorly translated inputs
- Repeating tokens/phrases in long outputs
Core Entities
Models
- Typhoon-7B
- Typhoon-7B-Instruct
- Mistral-7B
- OpenThaiGPT-beta-7B
- WangChanGLM
- SeaLLM-7B
- SEA-LION-7B
- GPT-3.5-turbo-0613
- GPT-4-0613
- Llama2-13B
- XGLM
Metrics
- win-rate (LLM judge)
- Accuracy
- BLEU
- chrF
- ROUGE-1/2/L
- F1 (XQuAD)
- perplexity
- token efficiency
Datasets
- ThaiExam
- ONET
- IC
- TGAT
- TPAT-1
- A-Level
- Thai AlpacaEval
- Thai OASST
- Translated MT-Bench
- Sea-bench (Thai subset)
- M3Exam (Thai subset)
- XNLI
- XCOPA
- FLORES-200
- XLSum
- CrossSum
- XQuAD
Benchmarks
- ThaiExam
- M3Exam
- MT-Bench (translated)
- Sea-bench

