Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.4
Citation Count
4
Why It Matters For Business
Model safety claims are often evaluated on a narrow, inconsistent set of datasets (sometimes proprietary), so businesses should adopt a broader, open suite of safety tests to make reliable, comparable claims.
Summary TLDR
This paper systematically reviews 144 open text datasets (published June 2018–Dec 2024) relevant to evaluating or improving LLM safety. It catalogs dataset purpose, format, creation method, language, licensing, and publication source and publishes a living catalogue at SafetyPrompts.com. Key findings: most datasets are English-only (78.5%), evaluation-focused (77.8%), and many recent datasets are synthetic or templated; major gaps are non-English and naturalistic user-data evaluations. The authors show model releases and benchmarks use only a small, idiosyncratic subset of available safety datasets and call for standardised, broader evaluations.
Problem Statement
Many safety datasets exist, but they are fragmented and uneven: practitioners struggle to find the right datasets, current model evaluations use a narrow subset (often proprietary), and critical gaps remain—especially non-English coverage and naturalistic user data.
Main Contribution
A systematic catalog and structured review of 144 open LLM safety text datasets (cutoff Dec 17, 2024).
A public, continuously updated catalogue (SafetyPrompts.com) and reproducible spreadsheet with metadata and code.
Analysis of dataset properties (purpose, format, creation, language, license, publication) and how datasets are used in model releases and benchmarks.
Concrete diagnosis of gaps: dominance of English and lack of naturalistic datasets, and recommendations for more standardised evaluation practices.
Key Findings
Total datasets reviewed: 144 open text datasets.
Most datasets are English-only.
Most datasets are for evaluation, not training.
Recent growth and reuse: 2023–24 saw rapid dataset production and higher reuse.
Synthetic and templated generation is common in recent datasets.
Model release safety evaluations are narrow and often proprietary.
Results
Datasets reviewed
English-only datasets
Eval-only datasets
Datasets published in 2024
Open datasets used in model releases
Who Should Care
What To Try In 7 Days
Browse SafetyPrompts.com and pick 5 open datasets matching your product personas (user, adversary, vulnerable).
Add at least one multilingual and one naturalistic user-prompt dataset to your safety checks.
Run your model on a small common set (TruthfulQA, SimpleSafetyTests, DoNotAnswer, XSTest) and publish the results.
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- The review only covers open text datasets published before Dec 17, 2024 and excludes multimodal and code-specific datasets.
- The paper catalogues dataset metadata but does not provide a unified quality score for each dataset.
- Despite a community-driven search, the review may still miss some relevant datasets.
When Not To Use
- Do not use this review as a substitute for task-specific dataset quality checks or for proprietary dataset discovery.
- Do not assume dataset suitability for training safety models without manual validation.
Failure Modes
- Relying on templated or synthetic tests may overestimate safety under real user interactions.
- Evaluating only on the few popular datasets can give a false sense of safety due to narrow coverage.
- Multilingual safety blind spots if teams only run English tests.
Core Entities
Models
- GPT-3.5
- Llama-70b-chat
- Gemma 2
- GPT-4o
- Claude
Datasets
- TruthfulQA
- BBQ
- AnthropicRedTeam
- RealToxicityPrompts
- ToxiGen
- WorldValuesBench
- SimpleSafetyTests
- DecodingTrust
- HarmBench
- XSTest
- WildChat
- DoNotAnswer
- SafetyKit
Benchmarks
- HELM
- TrustLLM
- LLM Safety Leaderboard
- LMSYS Chatbot Arena
- Evaluation Harness
- RewardBench

