A living, structured review of 144 open LLM safety datasets and gaps to close

Overview

Decision SnapshotReady For Pilot

This is a useful, up-to-date catalog and analysis for teams auditing or expanding safety testing. It compiles evidence across 144 datasets but does not itself rate dataset quality; users must still pick datasets appropriate to their product and validate on real user data.

Citations4

Evidence Strength0.80

Confidence0.82

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 50%

Authors

Paul Röttger, Fabio Pernisi, Bertie Vidgen, Dirk Hovy

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Model safety claims are often evaluated on a narrow, inconsistent set of datasets (sometimes proprietary), so businesses should adopt a broader, open suite of safety tests to make reliable, comparable claims.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

This paper systematically reviews 144 open text datasets (published June 2018–Dec 2024) relevant to evaluating or improving LLM safety. It catalogs dataset purpose, format, creation method, language, licensing, and publication source and publishes a living catalogue at SafetyPrompts.com. Key findings: most datasets are English-only (78.5%), evaluation-focused (77.8%), and many recent datasets are synthetic or templated; major gaps are non-English and naturalistic user-data evaluations. The authors show model releases and benchmarks use only a small, idiosyncratic subset of available safety datasets and call for standardised, broader evaluations.

Problem Statement

Many safety datasets exist, but they are fragmented and uneven: practitioners struggle to find the right datasets, current model evaluations use a narrow subset (often proprietary), and critical gaps remain—especially non-English coverage and naturalistic user data.

Main Contribution

A systematic catalog and structured review of 144 open LLM safety text datasets (cutoff Dec 17, 2024).

A public, continuously updated catalogue (SafetyPrompts.com) and reproducible spreadsheet with metadata and code.

Key Findings

Total datasets reviewed: 144 open text datasets.

Numbersn=144 datasets (published Jun 2018–Dec 2024)

Practical UseUse the SafetyPrompts catalogue to explore all 144 datasets instead of ad-hoc search.

Evidence RefAbstract; §2.2

Most datasets are English-only.

Numbers113/144 = 78.5% English-only

Practical UseExpect low coverage for non-English safety testing; prioritize building or sourcing multilingual evaluations.

Evidence Ref§3.6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Datasets reviewed	144	—	—	—	Total number of open LLM safety datasets included in the review	Abstract; §2.2
English-only datasets	113 (78.5%)	—	—	—	Share of datasets that are exclusively English	§3.6

What To Try In 7 Days

Browse SafetyPrompts.com and pick 5 open datasets matching your product personas (user, adversary, vulnerable).

Add at least one multilingual and one naturalistic user-prompt dataset to your safety checks.

Run your model on a small common set (TruthfulQA, SimpleSafetyTests, DoNotAnswer, XSTest) and publish the results.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/paul-rottger/safetyprompts-paper

Data URLs

https://safetyprompts.com https://github.com/paul-rottger/safetyprompts-paper

Risks & Boundaries

Limitations

The review only covers open text datasets published before Dec 17, 2024 and excludes multimodal and code-specific datasets.

The paper catalogues dataset metadata but does not provide a unified quality score for each dataset.

When Not To Use

Do not use this review as a substitute for task-specific dataset quality checks or for proprietary dataset discovery.

Do not assume dataset suitability for training safety models without manual validation.

Failure Modes

Relying on templated or synthetic tests may overestimate safety under real user interactions.

Evaluating only on the few popular datasets can give a false sense of safety due to narrow coverage.

Core Entities

Models

GPT-3.5Llama-70b-chatGemma 2GPT-4oClaude

Datasets

TruthfulQABBQAnthropicRedTeamRealToxicityPromptsToxiGenWorldValuesBenchSimpleSafetyTestsDecodingTrustHarmBenchXSTestWildChatDoNotAnswerSafetyKit

Benchmarks

HELMTrustLLMLLM Safety LeaderboardLMSYS Chatbot ArenaEvaluation HarnessRewardBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Total datasets reviewed: 144 open text datasets.

Most datasets are English-only.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Datasets

Benchmarks

You May Also Want to Read

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

Model judges reward ethics-based refusals; human users penalize them

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Key finding

A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

Key finding