ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Overview

Decision SnapshotNeeds Validation

The dataset and classifier are released and validated; results use two strong LLM judges with high agreement, but evaluations focus on malicious prompts only and omit usefulness trade-offs.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul, Pakhapoom Sarapat

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you deploy Thai-language LLM features, culture-specific attacks raise failure rates; use ThaiSafetyBench and the classifier to find gaps fast and prioritize alignment or safer model choices.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

The paper releases ThaiSafetyBench, a 1,954-sample Thai-language safety benchmark (1,889 public) focused on both general harms and Thai cultural scenarios. The authors evaluate 24 LLMs using GPT-4.1 and Gemini-2.5-Pro as automated judges and report higher attack success on Thai-specific prompts versus general prompts. They also publish a lightweight DeBERTa-based harmful-response classifier (weighted F1 ≈ 84.9%) and a public leaderboard to encourage community testing. The dataset and classifier make it practical to detect Thailand-specific safety gaps in models and to run low-cost automated checks.

Problem Statement

English-centered safety benchmarks miss culturally specific attacks. Thailand lacked a public, curated dataset and reproducible tools to measure how language models fail on Thai cultural and contextual prompts.

Main Contribution

ThaiSafetyBench: a curated Thai-language safety dataset of 1,954 malicious prompts with a 6-area, 17-harm taxonomy.

Automated safety evaluation of 24 models using GPT-4.1 and Gemini-2.5-Pro as LLM-as-a-judge and ASR (Attack Success Rate) as the metric.

Key Findings

Dataset size and composition

Numbers1,954 samples total; public subset 1,889 samples

Practical UseYou can run a focused Thai safety sweep with 1,954 prompts; use the public 1,889-sample subset to avoid legal/regulatory issues.

Evidence RefAbstract; Section 3

Over half of prompts are Thai-cultural

Numbers51.9% of samples explicitly Thai-cultural

Practical UseInclude culture-grounded tests when validating Thai deployments; general English tests will miss many failure modes.

Evidence RefSection 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size	1,954 prompts (public subset 1,889)	—	—	ThaiSafetyBench	Section 3; Abstract	—
Share of Thai-cultural samples	51.9% of dataset	—	—	ThaiSafetyBench	Section 3	—

What To Try In 7 Days

Run the public 1,889-sample ThaiSafetyBench subset on your model to surface cultural failure cases.

Use the released DeBERTa classifier to batch-evaluate model outputs and cut judge costs.

Compare closed-source vs your open-source model ASR and quantify how much extra alignment work is needed.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

HuggingFace Dataset (ThaiSafetyBench)HuggingFace Model (ThaiSafetyClassifier)HuggingFace Leaderboard and GitHub training scripts

Data URLs

HuggingFace Dataset (public subset of ThaiSafetyBench)Full dataset available on request (to comply with Thai regulation)

Risks & Boundaries

Limitations

Evaluation uses only malicious prompts and reports rejection rate; it does not measure usefulness or informative value of safe/unsafe responses.

Public release is a subset (1,889 of 1,954) to comply with Thai regulations; full dataset access may be restricted.

When Not To Use

When you need to assess model helpfulness or utility rather than rejection behavior.

When evaluating non-Thai languages or cross-cultural robustness outside Thailand.

Failure Modes

Judge bias: LLM-as-a-judge may under/over-report harms despite high correlation.

Over-rejection: models that block many prompts may score well but lose usefulness.

Core Entities

Models

GPT-4.1Gemini-2.5-ProGPT-5Claude 4.5 SonnetQwen2.5-72B-InstructSeaLLMs-v3-7BLlama-3.3-70B-InstructTyphoon2.1-gemma3-12bopenthaigpt1.5-72b-instruct

Metrics

Attack Success Rate (ASR)Weighted F1AccuracyPrecisionRecallSpearman correlation

Datasets

ThaiSafetyBenchDo-Not-Answer (translated)Anti-Fake News Center Thailand (transformed)

Benchmarks

ThaiSafetyBenchThaiSafetyBench Leaderboard

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Dataset size and composition

Over half of prompts are Thai-cultural

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Model judges reward ethics-based refusals; human users penalize them

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Key finding

A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

Key finding

A 150-control system-prompt governance layer (MDBC) that cuts aggregate LLM risk 36.8% vs. base.

Key finding