First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

February 21, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.5

Citation Count

0

Authors

Thura Aung, Jann Railey Montalan, Jian Gang Ngui, Peerat Limkonchotiwat

Links

Abstract / PDF

Why It Matters For Business

If you build Burmese products, off-the-shelf open models underperform commercial ones on many tasks. Region-specific fine-tuning and model choice matter more than parameter count. Conservative quantization can cut inference cost without large accuracy loss.

Summary TLDR

BURMESE-SAN is a 7-task, 3,920-sample benchmark built and validated by native speakers to measure LLM performance on Burmese across NLU, NLR, and NLG. Evaluations of many open and commercial models show commercial leaders, gains from SEA regional fine-tuning, non-linear scale effects, and that conservative quantization can preserve performance.

Problem Statement

Burmese lacks a comprehensive, high-quality benchmark for testing modern LLMs. Existing datasets are scattered, machine-translated, or low quality, hiding real gaps in Burmese capability.

Main Contribution

Introduce BURMESE-SAN: a human-verified, multi-task Burmese benchmark covering QA, Sentiment Analysis, Toxicity Detection, Causal Reasoning, NLI, Abstractive Summarization, and Machine Translation (7 subtasks, 3,920 samples).

Describe a native-speaker, four-step curation pipeline: sampling/filtering, translation, text normalization, and label verification with native revision.

Large-scale evaluation of many open-weight and commercial LLMs across model families and sizes, reporting normalized [0–100] scores and stability estimates.

Release benchmark and leaderboard for public use (dataset and code licensed CC-BY-SA 4.0).

Key Findings

Commercial models outperform open-weight models on Burmese overall.

NumbersTop commercial 72.35% vs top open 54.68% (≈+17.7%)

Model scale helps but is not sufficient; architecture and tuning matter.

NumbersLlama 3.3 (23.07%) → Llama 4 Maverick (51.49%) (+28.42%); but larger MoE Kimi K2 (1040B) underperforms smaller strong mo

Southeast-Asia (SEA) regional fine-tuning yields task-dependent gains.

NumbersMachine translation +19.29%; Question answering +7.04% (reported gains for SEA-finetuned variants)

Careful quantization can retain most performance; aggressive 4-bit can hurt reasoning.

Burmese capability has advanced quickly across recent model generations.

NumbersGPT-4o 51.61% → GPT-5 66.46% (+14.85%); SEA-LION series also shows steady generational gains

Results

Overall (MY) normalized score

ValueGemini 2.5 Pro 72.35%

BaselineTop open-weight 54.68% (ERNIE 4.5)

Machine Translation (MT) normalized score

ValueGemini 2.5 Pro 90.22% (MT column)

BaselineMany open models in 40–80% range depending on family

Question Answering (QA) normalized score

ValueTop ~80–85% for best instruction/reasoning models (family dependent)

Who Should Care

What To Try In 7 Days

Run BURMESE-SAN locally on 1–3 candidate models to replicate reported gaps (use provided prompt templates).

If cost-sensitive, test DynFP8 or 8-bit quantization on your top open model and validate reasoning tasks carefully.

Try SEA-focused fine-tuning (small budget) on a Llama or Qwen base for MT and QA subsets to measure quick gains.

Agent Features

Frameworks

  • SEA-LION (regional tuning pipeline)
  • vLLM (inference defaults referenced)

Architectures

  • MoE
  • Transformer (instruction-tuned variants)
  • Multimodal VL variants (Gemma/Qwen VL)

Optimization Features

Infra Optimization

  • Recommendations to validate quantized weights before deployment

Model Optimization

  • MoE
  • Quantization methods evaluated: DynFP8 (dynamic FP8) and NVFP4 (NVIDIA 4-bit)

System Optimization

  • Use of model-specific default decoding settings and stable zero-shot evaluation

Training Optimization

  • SEA regional fine-tuning (task- and family-dependent gains)
  • Instruction tuning and reasoning-focused RL-based tuning discussed

Inference Optimization

  • Conservative quantization (DynFP8/8-bit) preserves performance for many tasks
  • Aggressive 4-bit (NVFP4) can degrade reasoning models

Reproducibility

License

  • CC-BY-SA 4.0

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation uses formal written Burmese prompts only — results may not generalize to colloquial or spoken Burmese.
  • Benchmark covers Standard Burmese (central dialect) only; other dialects are excluded.
  • Some datasets were adapted from English; residual translation artifacts can remain despite re-translation and QC.

When Not To Use

  • When you need to evaluate colloquial, dialectal, or spoken Burmese performance.
  • When your use case requires domain-specific Burmese variants not covered by BURMESE-SAN.
  • If you need end-to-end conversational or multimodal Burmese evaluations beyond single-turn prompts.

Failure Modes

  • Model may appear competent on formal prompts but fail on informal, code-switched, or dialectal inputs.
  • Aggressive quantization (4-bit NVFP4) can break reasoning and culturally sensitive judgments.
  • High aggregate scores mask per-task weaknesses (e.g., some models score well on MT but poorly on NLI/QA).

Core Entities

Models

  • Gemini 2.5 Pro
  • GPT-5
  • GPT-4.1
  • Gemma 3
  • Qwen 3
  • Qwen 2.5
  • Llama 4 Maverick
  • ERNIE 4.5
  • SEA-LION (Qwen/Gemma/Llama variants)
  • DeepSeek V3.1
  • Qwen 3 (Thinking) 235B MoE

Metrics

  • Accuracy
  • MetricX-24 (rescaled)
  • ROUGE-L F1 (rescaled)
  • normalized [0,100] aggregate score

Datasets

  • Belebele
  • GKLMIP-mya
  • myHateSpeech
  • Balanced COPA
  • myXNLI
  • XL-Sum
  • FLORES+

Benchmarks

  • BURMESE-SAN
  • SEA-HELM
  • SEA-LION
  • XL-Sum (source for AS)
  • FLORES+ (source for MT)

Context Entities

Models

  • Gemma 2
  • Gemma 3 VL
  • Llama 3 / 3.1 / 3.3
  • Mistral
  • Qwen 3 Next
  • Kimi K2 Instruct
  • DeepSeek

Metrics

  • Cohen's Kappa
  • Krippendorff's Alpha
  • joint annotator agreement

Datasets

  • SEA-wide benchmarks (SEA-Crowd, SEAHELM, SeaExam, SeaBench)
  • NLLB / XL-Sum / FLORES+ (multi-lingual sources)

Benchmarks

  • HELM
  • SEA-EVAL