First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

February 21, 20268 min

Overview

Decision SnapshotReady For Pilot

Benchmark is practical and well-documented with native-speaker QA and agreement stats; results cover many widely used models. Evidence is strong for comparative claims but limited to formal prompts and Standard Burmese.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/3

Reproducibility

Status: Code + data available

Open source: Partial

License: CC-BY-SA 4.0

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Thura Aung, Jann Railey Montalan, Jian Gang Ngui, Peerat Limkonchotiwat

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you build Burmese products, off-the-shelf open models underperform commercial ones on many tasks. Region-specific fine-tuning and model choice matter more than parameter count. Conservative quantization can cut inference cost without large accuracy loss.

Who Should Care

Summary TLDR

BURMESE-SAN is a 7-task, 3,920-sample benchmark built and validated by native speakers to measure LLM performance on Burmese across NLU, NLR, and NLG. Evaluations of many open and commercial models show commercial leaders, gains from SEA regional fine-tuning, non-linear scale effects, and that conservative quantization can preserve performance.

Problem Statement

Burmese lacks a comprehensive, high-quality benchmark for testing modern LLMs. Existing datasets are scattered, machine-translated, or low quality, hiding real gaps in Burmese capability.

Main Contribution

Introduce BURMESE-SAN: a human-verified, multi-task Burmese benchmark covering QA, Sentiment Analysis, Toxicity Detection, Causal Reasoning, NLI, Abstractive Summarization, and Machine Translation (7 subtasks, 3,920 samples).

Describe a native-speaker, four-step curation pipeline: sampling/filtering, translation, text normalization, and label verification with native revision.

Key Findings

Commercial models outperform open-weight models on Burmese overall.

NumbersTop commercial 72.35% vs top open 54.68% (≈+17.7%)

Practical UseIf you need best Burmese performance now, prefer commercial models (e.g., Gemini 2.5 Pro). For lower-cost deployments, expect a substantial gap unless you fine-tune carefully.

Evidence RefTables 3-5; Finding #1

Model scale helps but is not sufficient; architecture and tuning matter.

NumbersLlama 3.3 (23.07%) → Llama 4 Maverick (51.49%) (+28.42%); but larger MoE Kimi K2 (1040B) underperforms smaller strong mo

Practical UseDo not assume more parameters alone solves Burmese gaps. Test different architectures and instruction/region tuning before scaling up.

Evidence RefFinding #2; Tables 3–5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Overall (MY) normalized scoreGemini 2.5 Pro 72.35%Top open-weight 54.68% (ERNIE 4.5)+17.67%BURMESE-SAN aggregateTables 3–5; Finding #1Tables 3-5
Machine Translation (MT) normalized scoreGemini 2.5 Pro 90.22% (MT column)Many open models in 4080% range depending on familyFLORES+ subsets in BURMESE-SANTable 3 and Table 5 MT scoresTables 3-5

What To Try In 7 Days

Run BURMESE-SAN locally on 1–3 candidate models to replicate reported gaps (use provided prompt templates).

If cost-sensitive, test DynFP8 or 8-bit quantization on your top open model and validate reasoning tasks carefully.

Try SEA-focused fine-tuning (small budget) on a Llama or Qwen base for MT and QA subsets to measure quick gains.

Agent Features

Frameworks
SEA-LION (regional tuning pipeline)vLLM (inference defaults referenced)
Architectures
MoETransformer (instruction-tuned variants)Multimodal VL variants (Gemma/Qwen VL)

Optimization Features

Infra Optimization
Recommendations to validate quantized weights before deployment
Model Optimization
MoEQuantization methods evaluated: DynFP8 (dynamic FP8) and NVFP4 (NVIDIA 4-bit)
System Optimization
Use of model-specific default decoding settings and stable zero-shot evaluation
Training Optimization
SEA regional fine-tuning (task- and family-dependent gains)Instruction tuning and reasoning-focused RL-based tuning discussed
Inference Optimization
Conservative quantization (DynFP8/8-bit) preserves performance for many tasksAggressive 4-bit (NVFP4) can degrade reasoning models

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseCC-BY-SA 4.0

Data URLs

https://leaderboard.sea-lion.ai/detailed/MYDataset links cited in paper (FLORES+, XL-Sum, Belebele, myHateSpeech, myXNLI)

Risks & Boundaries

Limitations

Evaluation uses formal written Burmese prompts only — results may not generalize to colloquial or spoken Burmese.

Benchmark covers Standard Burmese (central dialect) only; other dialects are excluded.

When Not To Use

When you need to evaluate colloquial, dialectal, or spoken Burmese performance.

When your use case requires domain-specific Burmese variants not covered by BURMESE-SAN.

Failure Modes

Model may appear competent on formal prompts but fail on informal, code-switched, or dialectal inputs.

Aggressive quantization (4-bit NVFP4) can break reasoning and culturally sensitive judgments.

Core Entities

Models

Gemini 2.5 ProGPT-5GPT-4.1Gemma 3Qwen 3Qwen 2.5Llama 4 MaverickERNIE 4.5SEA-LION (Qwen/Gemma/Llama variants)DeepSeek V3.1Qwen 3 (Thinking) 235B MoE

Metrics

AccuracyMetricX-24 (rescaled)ROUGE-L F1 (rescaled)normalized [0,100] aggregate score

Datasets

BelebeleGKLMIP-myamyHateSpeechBalanced COPAmyXNLIXL-SumFLORES+

Benchmarks

BURMESE-SANSEA-HELMSEA-LIONXL-Sum (source for AS)FLORES+ (source for MT)

Context Entities

Models

Gemma 2Gemma 3 VLLlama 3 / 3.1 / 3.3MistralQwen 3 NextKimi K2 InstructDeepSeek

Metrics

Cohen's KappaKrippendorff's Alphajoint annotator agreement

Datasets

SEA-wide benchmarks (SEA-Crowd, SEAHELM, SeaExam, SeaBench)NLLB / XL-Sum / FLORES+ (multi-lingual sources)

Benchmarks

HELMSEA-EVAL