ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

March 5, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul, Pakhapoom Sarapat

Links

Abstract / PDF

Why It Matters For Business

If you deploy Thai-language LLM features, culture-specific attacks raise failure rates; use ThaiSafetyBench and the classifier to find gaps fast and prioritize alignment or safer model choices.

Summary TLDR

The paper releases ThaiSafetyBench, a 1,954-sample Thai-language safety benchmark (1,889 public) focused on both general harms and Thai cultural scenarios. The authors evaluate 24 LLMs using GPT-4.1 and Gemini-2.5-Pro as automated judges and report higher attack success on Thai-specific prompts versus general prompts. They also publish a lightweight DeBERTa-based harmful-response classifier (weighted F1 ≈ 84.9%) and a public leaderboard to encourage community testing. The dataset and classifier make it practical to detect Thailand-specific safety gaps in models and to run low-cost automated checks.

Problem Statement

English-centered safety benchmarks miss culturally specific attacks. Thailand lacked a public, curated dataset and reproducible tools to measure how language models fail on Thai cultural and contextual prompts.

Main Contribution

ThaiSafetyBench: a curated Thai-language safety dataset of 1,954 malicious prompts with a 6-area, 17-harm taxonomy.

Automated safety evaluation of 24 models using GPT-4.1 and Gemini-2.5-Pro as LLM-as-a-judge and ASR (Attack Success Rate) as the metric.

ThaiSafetyClassifier: a DeBERTaV3-based harmful-response classifier fine-tuned with LoRA and released with weights and scripts.

ThaiSafetyBench leaderboard: public platform for continuous safety rankings and community submissions.

Key Findings

Dataset size and composition

Numbers1,954 samples total; public subset 1,889 samples

Over half of prompts are Thai-cultural

Numbers51.9% of samples explicitly Thai-cultural

Thai-specific attacks are more successful than general Thai attacks

Closed-source models outperform many open-source models on safety

Lightweight classifier matches strong judge performance

NumbersWeighted F1 84.9% (test set); accuracy 84.4%

Judge agreement is high

NumbersSpearman correlation 0.974 between GPT-4.1 and Gemini-2.5-Pro

Results

Dataset size

Value1,954 prompts (public subset 1,889)

Share of Thai-cultural samples

Value51.9% of dataset

Classifier weighted F1 (test)

Value84.9%

BaselineGPT-4.1 judgments (used as reference)

Judge agreement

ValueSpearman 0.974 between GPT-4.1 and Gemini-2.5-Pro

Thai-specific vs general ASR

ValueThai-specific attacks have higher ASR than general attacks (across evaluated models)

Baselinegeneral Thai prompts

Who Should Care

What To Try In 7 Days

Run the public 1,889-sample ThaiSafetyBench subset on your model to surface cultural failure cases.

Use the released DeBERTa classifier to batch-evaluate model outputs and cut judge costs.

Compare closed-source vs your open-source model ASR and quantify how much extra alignment work is needed.

Reproducibility

Code Urls

  • HuggingFace Dataset (ThaiSafetyBench)
  • HuggingFace Model (ThaiSafetyClassifier)
  • HuggingFace Leaderboard and GitHub training scripts

Data Urls

  • HuggingFace Dataset (public subset of ThaiSafetyBench)
  • Full dataset available on request (to comply with Thai regulation)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation uses only malicious prompts and reports rejection rate; it does not measure usefulness or informative value of safe/unsafe responses.
  • Public release is a subset (1,889 of 1,954) to comply with Thai regulations; full dataset access may be restricted.
  • Only simple prompt-based jailbreaks were tested; advanced culturally targeted jailbreaks remain untested.
  • Closed-source model superiority is observed on this benchmark but may reflect training data or proprietary alignment not detailed here.

When Not To Use

  • When you need to assess model helpfulness or utility rather than rejection behavior.
  • When evaluating non-Thai languages or cross-cultural robustness outside Thailand.
  • When you require tests against advanced jailbreak methods or multi-turn adversarial attacks.

Failure Modes

  • Judge bias: LLM-as-a-judge may under/over-report harms despite high correlation.
  • Over-rejection: models that block many prompts may score well but lose usefulness.
  • Dataset blind spots: some Thai cultural nuances or dialects may be missing.
  • Evaluation limited to single-turn prompts; multi-turn escalation not covered.

Core Entities

Models

  • GPT-4.1
  • Gemini-2.5-Pro
  • GPT-5
  • Claude 4.5 Sonnet
  • Qwen2.5-72B-Instruct
  • SeaLLMs-v3-7B
  • Llama-3.3-70B-Instruct
  • Typhoon2.1-gemma3-12b
  • openthaigpt1.5-72b-instruct

Metrics

  • Attack Success Rate (ASR)
  • Weighted F1
  • Accuracy
  • Precision
  • Recall
  • Spearman correlation

Datasets

  • ThaiSafetyBench
  • Do-Not-Answer (translated)
  • Anti-Fake News Center Thailand (transformed)

Benchmarks

  • ThaiSafetyBench
  • ThaiSafetyBench Leaderboard