ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

March 5, 20267 min

Overview

Decision SnapshotNeeds Validation

The dataset and classifier are released and validated; results use two strong LLM judges with high agreement, but evaluations focus on malicious prompts only and omit usefulness trade-offs.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul, Pakhapoom Sarapat

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you deploy Thai-language LLM features, culture-specific attacks raise failure rates; use ThaiSafetyBench and the classifier to find gaps fast and prioritize alignment or safer model choices.

Who Should Care

Summary TLDR

The paper releases ThaiSafetyBench, a 1,954-sample Thai-language safety benchmark (1,889 public) focused on both general harms and Thai cultural scenarios. The authors evaluate 24 LLMs using GPT-4.1 and Gemini-2.5-Pro as automated judges and report higher attack success on Thai-specific prompts versus general prompts. They also publish a lightweight DeBERTa-based harmful-response classifier (weighted F1 ≈ 84.9%) and a public leaderboard to encourage community testing. The dataset and classifier make it practical to detect Thailand-specific safety gaps in models and to run low-cost automated checks.

Problem Statement

English-centered safety benchmarks miss culturally specific attacks. Thailand lacked a public, curated dataset and reproducible tools to measure how language models fail on Thai cultural and contextual prompts.

Main Contribution

ThaiSafetyBench: a curated Thai-language safety dataset of 1,954 malicious prompts with a 6-area, 17-harm taxonomy.

Automated safety evaluation of 24 models using GPT-4.1 and Gemini-2.5-Pro as LLM-as-a-judge and ASR (Attack Success Rate) as the metric.

Key Findings

Dataset size and composition

Numbers1,954 samples total; public subset 1,889 samples

Practical UseYou can run a focused Thai safety sweep with 1,954 prompts; use the public 1,889-sample subset to avoid legal/regulatory issues.

Evidence RefAbstract; Section 3

Over half of prompts are Thai-cultural

Numbers51.9% of samples explicitly Thai-cultural

Practical UseInclude culture-grounded tests when validating Thai deployments; general English tests will miss many failure modes.

Evidence RefSection 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Dataset size1,954 prompts (public subset 1,889)ThaiSafetyBenchSection 3; Abstract
Share of Thai-cultural samples51.9% of datasetThaiSafetyBenchSection 3

What To Try In 7 Days

Run the public 1,889-sample ThaiSafetyBench subset on your model to surface cultural failure cases.

Use the released DeBERTa classifier to batch-evaluate model outputs and cut judge costs.

Compare closed-source vs your open-source model ASR and quantify how much extra alignment work is needed.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Code URLs

HuggingFace Dataset (ThaiSafetyBench)HuggingFace Model (ThaiSafetyClassifier)HuggingFace Leaderboard and GitHub training scripts

Data URLs

HuggingFace Dataset (public subset of ThaiSafetyBench)Full dataset available on request (to comply with Thai regulation)

Risks & Boundaries

Limitations

Evaluation uses only malicious prompts and reports rejection rate; it does not measure usefulness or informative value of safe/unsafe responses.

Public release is a subset (1,889 of 1,954) to comply with Thai regulations; full dataset access may be restricted.

When Not To Use

When you need to assess model helpfulness or utility rather than rejection behavior.

When evaluating non-Thai languages or cross-cultural robustness outside Thailand.

Failure Modes

Judge bias: LLM-as-a-judge may under/over-report harms despite high correlation.

Over-rejection: models that block many prompts may score well but lose usefulness.

Core Entities

Models

GPT-4.1Gemini-2.5-ProGPT-5Claude 4.5 SonnetQwen2.5-72B-InstructSeaLLMs-v3-7BLlama-3.3-70B-InstructTyphoon2.1-gemma3-12bopenthaigpt1.5-72b-instruct

Metrics

Attack Success Rate (ASR)Weighted F1AccuracyPrecisionRecallSpearman correlation

Datasets

ThaiSafetyBenchDo-Not-Answer (translated)Anti-Fake News Center Thailand (transformed)

Benchmarks

ThaiSafetyBenchThaiSafetyBench Leaderboard