Overview
The dataset and classifier are released and validated; results use two strong LLM judges with high agreement, but evaluations focus on malicious prompts only and omit usefulness trade-offs.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/6
Findings with evidence refs: 6/6
Results with explicit delta: 0/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
If you deploy Thai-language LLM features, culture-specific attacks raise failure rates; use ThaiSafetyBench and the classifier to find gaps fast and prioritize alignment or safer model choices.
Who Should Care
Summary TLDR
The paper releases ThaiSafetyBench, a 1,954-sample Thai-language safety benchmark (1,889 public) focused on both general harms and Thai cultural scenarios. The authors evaluate 24 LLMs using GPT-4.1 and Gemini-2.5-Pro as automated judges and report higher attack success on Thai-specific prompts versus general prompts. They also publish a lightweight DeBERTa-based harmful-response classifier (weighted F1 ≈ 84.9%) and a public leaderboard to encourage community testing. The dataset and classifier make it practical to detect Thailand-specific safety gaps in models and to run low-cost automated checks.
Problem Statement
English-centered safety benchmarks miss culturally specific attacks. Thailand lacked a public, curated dataset and reproducible tools to measure how language models fail on Thai cultural and contextual prompts.
Main Contribution
ThaiSafetyBench: a curated Thai-language safety dataset of 1,954 malicious prompts with a 6-area, 17-harm taxonomy.
Automated safety evaluation of 24 models using GPT-4.1 and Gemini-2.5-Pro as LLM-as-a-judge and ASR (Attack Success Rate) as the metric.
Key Findings
Dataset size and composition
Over half of prompts are Thai-cultural
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Dataset size | 1,954 prompts (public subset 1,889) | — | — | ThaiSafetyBench | Section 3; Abstract | — |
| Share of Thai-cultural samples | 51.9% of dataset | — | — | ThaiSafetyBench | Section 3 | — |
What To Try In 7 Days
Run the public 1,889-sample ThaiSafetyBench subset on your model to surface cultural failure cases.
Use the released DeBERTa classifier to batch-evaluate model outputs and cut judge costs.
Compare closed-source vs your open-source model ASR and quantify how much extra alignment work is needed.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Evaluation uses only malicious prompts and reports rejection rate; it does not measure usefulness or informative value of safe/unsafe responses.
Public release is a subset (1,889 of 1,954) to comply with Thai regulations; full dataset access may be restricted.
When Not To Use
When you need to assess model helpfulness or utility rather than rejection behavior.
When evaluating non-Thai languages or cross-cultural robustness outside Thailand.
Failure Modes
Judge bias: LLM-as-a-judge may under/over-report harms despite high correlation.
Over-rejection: models that block many prompts may score well but lose usefulness.

