Typhoon: a 7B Thai-focused LLM that matches GPT-3.5 on many Thai tasks and tokenizes Thai 2.62× more efficiently

December 21, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

7

Authors

Kunat Pipatanakul, Phatrasek Jirabovonvisut, Potsawee Manakul, Sittipong Sripaisarnmongkol, Ruangsak Patomwong, Pathomporn Chokchainant, Kasima Tharnpipitchai

Links

Abstract / PDF

Why It Matters For Business

Typhoon gives companies a ready open-source Thai LLM that saves token costs (≈2.6×) and outperforms other open Thai models on exams and many Thai tasks, reducing engineering time versus building a Thai model from scratch.

Summary TLDR

Typhoon is a 7-billion-parameter LLM adapted from Mistral-7B and further trained on cleaned Thai web data plus English to avoid forgetting. The team built ThaiExam (a multi-exam multiple-choice benchmark) and a set of Thai instruction datasets. Typhoon outperforms other open-source Thai models on Thai exams, reaches near GPT-3.5 parity on several Thai tasks after instruction-tuning, and uses a tokenizer that is 2.62× more token-efficient for Thai text. Model weights are available under Apache-2.0.

Problem Statement

Thai is under-represented in standard pretraining corpora (e.g., <0.5% of Common Crawl). Generic and multilingual LLMs can miss Thai-specific facts, style, and cultural norms. The paper asks: can we adapt a strong English-centric LLM to Thai efficiently, and how to measure Thai knowledge reliably?

Main Contribution

Typhoon-7B: a Thai-focused 7B LLM adapted from Mistral-7B with continued pretraining on cleaned Thai+English data.

ThaiExam: a new benchmark assembled from Thai national and professional exams to measure Thai knowledge.

Instruction-tuned Typhoon-7B-Instruct using translated and self-instruct Thai data, evaluated via LLM-as-judge and standard NLP tasks.

Tokenizer and data pipeline work that produces 2.62× token efficiency on Thai and a cleaned 3 TB Thai text corpus.

Key Findings

Typhoon is the best open-source Thai LLM on evaluated Thai benchmarks.

NumbersThaiExam average 0.442 vs next best SeaLLM 0.366

Typhoon's Thai tokenizer is 2.62× more efficient than GPT-4 on Thai text.

NumbersToken efficiency 262% vs GPT-4 100% (2.62×)

Instruction tuning using translated and self-instruct Thai data yields competitive instruction-following.

NumbersWin-rate vs GPT-3.5: AlpacaEval 49.05%, MT-Bench 55.52% (Table 4)

Typhoon performs well on standard NLP tasks in zero-shot settings.

NumbersFLORES-200 MT BLEU 31.14; XQuAD 0-shot F1 34.46

LoRA-based continued training on embeddings and the LM head was effective for adaptation.

Results

Accuracy

Value0.442

BaselineSeaLLM-7B 0.366

Tokenizer efficiency (relative to GPT-4)

Value262%

BaselineGPT-4 100%

Instruction-following win-rate (vs GPT-3.5)

Value49.05%

BaselineGPT-3.5 reference (0.50 expected)

Instruction-following win-rate (vs GPT-3.5)

Value55.52%

BaselineGPT-3.5 reference (0.50 expected)

Machine translation BLEU / chrF (En→Th)

Value31.14 / 46.62

BaselineSeaLLM-7B 14.36 / 37.13

Question-answering XQuAD F1 (0-shot / 1-shot)

Value34.46 / 54.03

BaselineSeaLLM-7B 20.89 / 47.95

Who Should Care

What To Try In 7 Days

Download Typhoon-7B from HuggingFace and run a quick QA/translation smoke test.

Measure token counts and cost with Typhoon tokenizer vs your current model on sample Thai traffic.

Fine-tune Typhoon with a small in-house Thai instruction set via LoRA for domain-specific responses.

Optimization Features

Token Efficiency

  • Tokenizer yields 2.62× fewer tokens for Thai vs GPT-4 tokenizer

Model Optimization

  • LoRA

Training Optimization

  • Mixed Thai/English 50/50 data to mitigate catastrophic forgetting
  • Large batch sizes (2M tokens) stabilized training

Inference Optimization

  • Smaller vocabulary expansion and Thai subword tokens to reduce token counts

Reproducibility

License

  • Apache-2.0

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • May hallucinate or produce incorrect facts (not fully mitigated)
  • Shows repetition in generated text in some cases
  • Instruction-following is uneven: worse on some datasets like Thai OASST and Sea-bench

When Not To Use

  • For high-stakes factual systems without extra verification
  • If you require guaranteed safety filtering and RLHF-level alignment

Failure Modes

  • Hallucination: plausible but wrong facts
  • Degraded performance on out-of-distribution instructions or poorly translated inputs
  • Repeating tokens/phrases in long outputs

Core Entities

Models

  • Typhoon-7B
  • Typhoon-7B-Instruct
  • Mistral-7B
  • OpenThaiGPT-beta-7B
  • WangChanGLM
  • SeaLLM-7B
  • SEA-LION-7B
  • GPT-3.5-turbo-0613
  • GPT-4-0613
  • Llama2-13B
  • XGLM

Metrics

  • win-rate (LLM judge)
  • Accuracy
  • BLEU
  • chrF
  • ROUGE-1/2/L
  • F1 (XQuAD)
  • perplexity
  • token efficiency

Datasets

  • ThaiExam
  • ONET
  • IC
  • TGAT
  • TPAT-1
  • A-Level
  • Thai AlpacaEval
  • Thai OASST
  • Translated MT-Bench
  • Sea-bench (Thai subset)
  • M3Exam (Thai subset)
  • XNLI
  • XCOPA
  • FLORES-200
  • XLSum
  • CrossSum
  • XQuAD

Benchmarks

  • ThaiExam
  • M3Exam
  • MT-Bench (translated)
  • Sea-bench