Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
If you deploy Thai-language LLM features, culture-specific attacks raise failure rates; use ThaiSafetyBench and the classifier to find gaps fast and prioritize alignment or safer model choices.
Summary TLDR
The paper releases ThaiSafetyBench, a 1,954-sample Thai-language safety benchmark (1,889 public) focused on both general harms and Thai cultural scenarios. The authors evaluate 24 LLMs using GPT-4.1 and Gemini-2.5-Pro as automated judges and report higher attack success on Thai-specific prompts versus general prompts. They also publish a lightweight DeBERTa-based harmful-response classifier (weighted F1 ≈ 84.9%) and a public leaderboard to encourage community testing. The dataset and classifier make it practical to detect Thailand-specific safety gaps in models and to run low-cost automated checks.
Problem Statement
English-centered safety benchmarks miss culturally specific attacks. Thailand lacked a public, curated dataset and reproducible tools to measure how language models fail on Thai cultural and contextual prompts.
Main Contribution
ThaiSafetyBench: a curated Thai-language safety dataset of 1,954 malicious prompts with a 6-area, 17-harm taxonomy.
Automated safety evaluation of 24 models using GPT-4.1 and Gemini-2.5-Pro as LLM-as-a-judge and ASR (Attack Success Rate) as the metric.
ThaiSafetyClassifier: a DeBERTaV3-based harmful-response classifier fine-tuned with LoRA and released with weights and scripts.
ThaiSafetyBench leaderboard: public platform for continuous safety rankings and community submissions.
Key Findings
Dataset size and composition
Over half of prompts are Thai-cultural
Thai-specific attacks are more successful than general Thai attacks
Closed-source models outperform many open-source models on safety
Lightweight classifier matches strong judge performance
Judge agreement is high
Results
Dataset size
Share of Thai-cultural samples
Classifier weighted F1 (test)
Judge agreement
Thai-specific vs general ASR
Who Should Care
What To Try In 7 Days
Run the public 1,889-sample ThaiSafetyBench subset on your model to surface cultural failure cases.
Use the released DeBERTa classifier to batch-evaluate model outputs and cut judge costs.
Compare closed-source vs your open-source model ASR and quantify how much extra alignment work is needed.
Reproducibility
Code Urls
- HuggingFace Dataset (ThaiSafetyBench)
- HuggingFace Model (ThaiSafetyClassifier)
- HuggingFace Leaderboard and GitHub training scripts
Data Urls
- HuggingFace Dataset (public subset of ThaiSafetyBench)
- Full dataset available on request (to comply with Thai regulation)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation uses only malicious prompts and reports rejection rate; it does not measure usefulness or informative value of safe/unsafe responses.
- Public release is a subset (1,889 of 1,954) to comply with Thai regulations; full dataset access may be restricted.
- Only simple prompt-based jailbreaks were tested; advanced culturally targeted jailbreaks remain untested.
- Closed-source model superiority is observed on this benchmark but may reflect training data or proprietary alignment not detailed here.
When Not To Use
- When you need to assess model helpfulness or utility rather than rejection behavior.
- When evaluating non-Thai languages or cross-cultural robustness outside Thailand.
- When you require tests against advanced jailbreak methods or multi-turn adversarial attacks.
Failure Modes
- Judge bias: LLM-as-a-judge may under/over-report harms despite high correlation.
- Over-rejection: models that block many prompts may score well but lose usefulness.
- Dataset blind spots: some Thai cultural nuances or dialects may be missing.
- Evaluation limited to single-turn prompts; multi-turn escalation not covered.
Core Entities
Models
- GPT-4.1
- Gemini-2.5-Pro
- GPT-5
- Claude 4.5 Sonnet
- Qwen2.5-72B-Instruct
- SeaLLMs-v3-7B
- Llama-3.3-70B-Instruct
- Typhoon2.1-gemma3-12b
- openthaigpt1.5-72b-instruct
Metrics
- Attack Success Rate (ASR)
- Weighted F1
- Accuracy
- Precision
- Recall
- Spearman correlation
Datasets
- ThaiSafetyBench
- Do-Not-Answer (translated)
- Anti-Fake News Center Thailand (transformed)
Benchmarks
- ThaiSafetyBench
- ThaiSafetyBench Leaderboard

