Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
If you build Burmese products, off-the-shelf open models underperform commercial ones on many tasks. Region-specific fine-tuning and model choice matter more than parameter count. Conservative quantization can cut inference cost without large accuracy loss.
Summary TLDR
BURMESE-SAN is a 7-task, 3,920-sample benchmark built and validated by native speakers to measure LLM performance on Burmese across NLU, NLR, and NLG. Evaluations of many open and commercial models show commercial leaders, gains from SEA regional fine-tuning, non-linear scale effects, and that conservative quantization can preserve performance.
Problem Statement
Burmese lacks a comprehensive, high-quality benchmark for testing modern LLMs. Existing datasets are scattered, machine-translated, or low quality, hiding real gaps in Burmese capability.
Main Contribution
Introduce BURMESE-SAN: a human-verified, multi-task Burmese benchmark covering QA, Sentiment Analysis, Toxicity Detection, Causal Reasoning, NLI, Abstractive Summarization, and Machine Translation (7 subtasks, 3,920 samples).
Describe a native-speaker, four-step curation pipeline: sampling/filtering, translation, text normalization, and label verification with native revision.
Large-scale evaluation of many open-weight and commercial LLMs across model families and sizes, reporting normalized [0–100] scores and stability estimates.
Release benchmark and leaderboard for public use (dataset and code licensed CC-BY-SA 4.0).
Key Findings
Commercial models outperform open-weight models on Burmese overall.
Model scale helps but is not sufficient; architecture and tuning matter.
Southeast-Asia (SEA) regional fine-tuning yields task-dependent gains.
Careful quantization can retain most performance; aggressive 4-bit can hurt reasoning.
Burmese capability has advanced quickly across recent model generations.
Results
Overall (MY) normalized score
Machine Translation (MT) normalized score
Question Answering (QA) normalized score
Who Should Care
What To Try In 7 Days
Run BURMESE-SAN locally on 1–3 candidate models to replicate reported gaps (use provided prompt templates).
If cost-sensitive, test DynFP8 or 8-bit quantization on your top open model and validate reasoning tasks carefully.
Try SEA-focused fine-tuning (small budget) on a Llama or Qwen base for MT and QA subsets to measure quick gains.
Agent Features
Frameworks
- SEA-LION (regional tuning pipeline)
- vLLM (inference defaults referenced)
Architectures
- MoE
- Transformer (instruction-tuned variants)
- Multimodal VL variants (Gemma/Qwen VL)
Optimization Features
Infra Optimization
- Recommendations to validate quantized weights before deployment
Model Optimization
- MoE
- Quantization methods evaluated: DynFP8 (dynamic FP8) and NVFP4 (NVIDIA 4-bit)
System Optimization
- Use of model-specific default decoding settings and stable zero-shot evaluation
Training Optimization
- SEA regional fine-tuning (task- and family-dependent gains)
- Instruction tuning and reasoning-focused RL-based tuning discussed
Inference Optimization
- Conservative quantization (DynFP8/8-bit) preserves performance for many tasks
- Aggressive 4-bit (NVFP4) can degrade reasoning models
Reproducibility
License
- CC-BY-SA 4.0
Data Urls
- https://leaderboard.sea-lion.ai/detailed/MY
- Dataset links cited in paper (FLORES+, XL-Sum, Belebele, myHateSpeech, myXNLI)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation uses formal written Burmese prompts only — results may not generalize to colloquial or spoken Burmese.
- Benchmark covers Standard Burmese (central dialect) only; other dialects are excluded.
- Some datasets were adapted from English; residual translation artifacts can remain despite re-translation and QC.
When Not To Use
- When you need to evaluate colloquial, dialectal, or spoken Burmese performance.
- When your use case requires domain-specific Burmese variants not covered by BURMESE-SAN.
- If you need end-to-end conversational or multimodal Burmese evaluations beyond single-turn prompts.
Failure Modes
- Model may appear competent on formal prompts but fail on informal, code-switched, or dialectal inputs.
- Aggressive quantization (4-bit NVFP4) can break reasoning and culturally sensitive judgments.
- High aggregate scores mask per-task weaknesses (e.g., some models score well on MT but poorly on NLI/QA).
Core Entities
Models
- Gemini 2.5 Pro
- GPT-5
- GPT-4.1
- Gemma 3
- Qwen 3
- Qwen 2.5
- Llama 4 Maverick
- ERNIE 4.5
- SEA-LION (Qwen/Gemma/Llama variants)
- DeepSeek V3.1
- Qwen 3 (Thinking) 235B MoE
Metrics
- Accuracy
- MetricX-24 (rescaled)
- ROUGE-L F1 (rescaled)
- normalized [0,100] aggregate score
Datasets
- Belebele
- GKLMIP-mya
- myHateSpeech
- Balanced COPA
- myXNLI
- XL-Sum
- FLORES+
Benchmarks
- BURMESE-SAN
- SEA-HELM
- SEA-LION
- XL-Sum (source for AS)
- FLORES+ (source for MT)
Context Entities
Models
- Gemma 2
- Gemma 3 VL
- Llama 3 / 3.1 / 3.3
- Mistral
- Qwen 3 Next
- Kimi K2 Instruct
- DeepSeek
Metrics
- Cohen's Kappa
- Krippendorff's Alpha
- joint annotator agreement
Datasets
- SEA-wide benchmarks (SEA-Crowd, SEAHELM, SeaExam, SeaBench)
- NLLB / XL-Sum / FLORES+ (multi-lingual sources)
Benchmarks
- HELM
- SEA-EVAL

