Overview
Benchmark is practical and well-documented with native-speaker QA and agreement stats; results cover many widely used models. Evidence is strong for comparative claims but limited to formal prompts and Standard Burmese.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/3
Reproducibility
Status: Code + data available
Open source: Partial
License: CC-BY-SA 4.0
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
If you build Burmese products, off-the-shelf open models underperform commercial ones on many tasks. Region-specific fine-tuning and model choice matter more than parameter count. Conservative quantization can cut inference cost without large accuracy loss.
Who Should Care
Summary TLDR
BURMESE-SAN is a 7-task, 3,920-sample benchmark built and validated by native speakers to measure LLM performance on Burmese across NLU, NLR, and NLG. Evaluations of many open and commercial models show commercial leaders, gains from SEA regional fine-tuning, non-linear scale effects, and that conservative quantization can preserve performance.
Problem Statement
Burmese lacks a comprehensive, high-quality benchmark for testing modern LLMs. Existing datasets are scattered, machine-translated, or low quality, hiding real gaps in Burmese capability.
Main Contribution
Introduce BURMESE-SAN: a human-verified, multi-task Burmese benchmark covering QA, Sentiment Analysis, Toxicity Detection, Causal Reasoning, NLI, Abstractive Summarization, and Machine Translation (7 subtasks, 3,920 samples).
Describe a native-speaker, four-step curation pipeline: sampling/filtering, translation, text normalization, and label verification with native revision.
Key Findings
Commercial models outperform open-weight models on Burmese overall.
Model scale helps but is not sufficient; architecture and tuning matter.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Overall (MY) normalized score | Gemini 2.5 Pro 72.35% | Top open-weight 54.68% (ERNIE 4.5) | +17.67% | BURMESE-SAN aggregate | Tables 3–5; Finding #1 | Tables 3-5 |
| Machine Translation (MT) normalized score | Gemini 2.5 Pro 90.22% (MT column) | Many open models in 40–80% range depending on family | — | FLORES+ subsets in BURMESE-SAN | Table 3 and Table 5 MT scores | Tables 3-5 |
What To Try In 7 Days
Run BURMESE-SAN locally on 1–3 candidate models to replicate reported gaps (use provided prompt templates).
If cost-sensitive, test DynFP8 or 8-bit quantization on your top open model and validate reasoning tasks carefully.
Try SEA-focused fine-tuning (small budget) on a Llama or Qwen base for MT and QA subsets to measure quick gains.
Agent Features
Frameworks
Architectures
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Evaluation uses formal written Burmese prompts only — results may not generalize to colloquial or spoken Burmese.
Benchmark covers Standard Burmese (central dialect) only; other dialects are excluded.
When Not To Use
When you need to evaluate colloquial, dialectal, or spoken Burmese performance.
When your use case requires domain-specific Burmese variants not covered by BURMESE-SAN.
Failure Modes
Model may appear competent on formal prompts but fail on informal, code-switched, or dialectal inputs.
Aggressive quantization (4-bit NVFP4) can break reasoning and culturally sensitive judgments.

