First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Overview

Decision SnapshotReady For Pilot

Benchmark is practical and well-documented with native-speaker QA and agreement stats; results cover many widely used models. Evidence is strong for comparative claims but limited to formal prompts and Standard Burmese.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/3

Reproducibility

Status: Code + data available

Open source: Partial

License: CC-BY-SA 4.0

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 40%

Authors

Thura Aung, Jann Railey Montalan, Jian Gang Ngui, Peerat Limkonchotiwat

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you build Burmese products, off-the-shelf open models underperform commercial ones on many tasks. Region-specific fine-tuning and model choice matter more than parameter count. Conservative quantization can cut inference cost without large accuracy loss.

Who Should Care

CTO Product Manager ML Engineer Founder

Summary TLDR

BURMESE-SAN is a 7-task, 3,920-sample benchmark built and validated by native speakers to measure LLM performance on Burmese across NLU, NLR, and NLG. Evaluations of many open and commercial models show commercial leaders, gains from SEA regional fine-tuning, non-linear scale effects, and that conservative quantization can preserve performance.

Problem Statement

Burmese lacks a comprehensive, high-quality benchmark for testing modern LLMs. Existing datasets are scattered, machine-translated, or low quality, hiding real gaps in Burmese capability.

Main Contribution

Introduce BURMESE-SAN: a human-verified, multi-task Burmese benchmark covering QA, Sentiment Analysis, Toxicity Detection, Causal Reasoning, NLI, Abstractive Summarization, and Machine Translation (7 subtasks, 3,920 samples).

Describe a native-speaker, four-step curation pipeline: sampling/filtering, translation, text normalization, and label verification with native revision.

Key Findings

Commercial models outperform open-weight models on Burmese overall.

NumbersTop commercial 72.35% vs top open 54.68% (≈+17.7%)

Practical UseIf you need best Burmese performance now, prefer commercial models (e.g., Gemini 2.5 Pro). For lower-cost deployments, expect a substantial gap unless you fine-tune carefully.

Evidence RefTables 3-5; Finding #1

Model scale helps but is not sufficient; architecture and tuning matter.

NumbersLlama 3.3 (23.07%) → Llama 4 Maverick (51.49%) (+28.42%); but larger MoE Kimi K2 (1040B) underperforms smaller strong mo

Practical UseDo not assume more parameters alone solves Burmese gaps. Test different architectures and instruction/region tuning before scaling up.

Evidence RefFinding #2; Tables 3–5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Overall (MY) normalized score	Gemini 2.5 Pro 72.35%	Top open-weight 54.68% (ERNIE 4.5)	+17.67%	BURMESE-SAN aggregate	Tables 3–5; Finding #1	Tables 3-5
Machine Translation (MT) normalized score	Gemini 2.5 Pro 90.22% (MT column)	Many open models in 40–80% range depending on family	—	FLORES+ subsets in BURMESE-SAN	Table 3 and Table 5 MT scores	Tables 3-5

What To Try In 7 Days

Run BURMESE-SAN locally on 1–3 candidate models to replicate reported gaps (use provided prompt templates).

If cost-sensitive, test DynFP8 or 8-bit quantization on your top open model and validate reasoning tasks carefully.

Try SEA-focused fine-tuning (small budget) on a Llama or Qwen base for MT and QA subsets to measure quick gains.

Agent Features

Frameworks

SEA-LION (regional tuning pipeline)vLLM (inference defaults referenced)

Architectures

MoETransformer (instruction-tuned variants)Multimodal VL variants (Gemma/Qwen VL)

Optimization Features

Infra Optimization

Recommendations to validate quantized weights before deployment

Model Optimization

MoEQuantization methods evaluated: DynFP8 (dynamic FP8) and NVFP4 (NVIDIA 4-bit)

System Optimization

Use of model-specific default decoding settings and stable zero-shot evaluation

Training Optimization

SEA regional fine-tuning (task- and family-dependent gains)Instruction tuning and reasoning-focused RL-based tuning discussed

Inference Optimization

Conservative quantization (DynFP8/8-bit) preserves performance for many tasksAggressive 4-bit (NVFP4) can degrade reasoning models

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseCC-BY-SA 4.0

Code URLs

https://github.com/aisingapore/SEA-HELM https://leaderboard.sea-lion.ai/detailed/MY

Data URLs

https://leaderboard.sea-lion.ai/detailed/MYDataset links cited in paper (FLORES+, XL-Sum, Belebele, myHateSpeech, myXNLI)

Risks & Boundaries

Limitations

Evaluation uses formal written Burmese prompts only — results may not generalize to colloquial or spoken Burmese.

Benchmark covers Standard Burmese (central dialect) only; other dialects are excluded.

When Not To Use

When you need to evaluate colloquial, dialectal, or spoken Burmese performance.

When your use case requires domain-specific Burmese variants not covered by BURMESE-SAN.

Failure Modes

Model may appear competent on formal prompts but fail on informal, code-switched, or dialectal inputs.

Aggressive quantization (4-bit NVFP4) can break reasoning and culturally sensitive judgments.

Core Entities

Models

Gemini 2.5 ProGPT-5GPT-4.1Gemma 3Qwen 3Qwen 2.5Llama 4 MaverickERNIE 4.5SEA-LION (Qwen/Gemma/Llama variants)DeepSeek V3.1Qwen 3 (Thinking) 235B MoE

Metrics

AccuracyMetricX-24 (rescaled)ROUGE-L F1 (rescaled)normalized [0,100] aggregate score

Datasets

BelebeleGKLMIP-myamyHateSpeechBalanced COPAmyXNLIXL-SumFLORES+

Benchmarks

BURMESE-SANSEA-HELMSEA-LIONXL-Sum (source for AS)FLORES+ (source for MT)

Context Entities

Models

Gemma 2Gemma 3 VLLlama 3 / 3.1 / 3.3MistralQwen 3 NextKimi K2 InstructDeepSeek

Metrics

Cohen's KappaKrippendorff's Alphajoint annotator agreement

Datasets

SEA-wide benchmarks (SEA-Crowd, SEAHELM, SeaExam, SeaBench)NLLB / XL-Sum / FLORES+ (multi-lingual sources)

Benchmarks

HELMSEA-EVAL

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Commercial models outperform open-weight models on Burmese overall.

Model scale helps but is not sufficient; architecture and tuning matter.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding