Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
16
Why It Matters For Business
A simple, repeatable Chinese safety benchmark and a 100k prompt library let product and security teams run systematic red-teaming and compare model choices quickly.
Summary TLDR
The paper builds a Chinese-focused safety benchmark that tests 14 safety dimensions (8 typical scenarios + 6 instruction-attacks), evaluates 15 LLMs using an LLM-based automatic judge, and releases SAFETYPROMPTS — a 100k augmented prompt/response library. Key findings: instruction attacks reveal more failures than typical prompts, ChatGPT ranks highest especially on instruction attacks, and automated evaluation relies on an LLM evaluator (InstructGPT) with simple verbalization rules. Use the benchmark for red-teaming and comparative checks, but not as a final safety certification because the evaluator and single-turn tests have known blind spots.
Problem Statement
Large Chinese LLMs can produce insulting, biased, illegal, or harmful outputs. There was no large, public Chinese safety benchmark or prompt library for red-teaming and comparison. The paper fills that gap with a taxonomy, test prompts, automatic LLM-based scoring, a leaderboard, and a 100k augmented prompt dataset.
Main Contribution
A structured safety taxonomy: 8 typical scenarios and 6 instruction-attack types for Chinese LLMs.
A manually written test prompt set (∼8.9k prompts) split into public and private test sets.
An automatic safety evaluator that uses an LLM (InstructGPT) to label responses as safe/unsafe.
A public leaderboard that ranks 15 evaluated LLMs across 14 safety dimensions.
SAFETYPROMPTS: a released library of 100k augmented prompts and model responses for red-teaming and tuning.
Key Findings
Instruction attacks are consistently harder for models than typical safety prompts.
ChatGPT ranks highest on overall safety and especially on instruction attacks.
The authors released a 100k augmented prompt and response library named SAFETYPROMPTS.
Evaluation is automated with an LLM evaluator (InstructGPT) and a simple verbalizer rule.
Manually collected test prompts total about 8,881 examples across 14 scenarios.
Even ChatGPT produced unsafe responses in the augmented dataset at low rates.
Results
Instruction-attack vs typical safety score
Goal Hijacking gap
Size of supplemented prompts
ChatGPT unsafe response rate (on augmented set)
Who Should Care
What To Try In 7 Days
Run your model against SAFETYPROMPTS to find easy-to-trigger failures.
Add the six instruction-attack types (goal hijacking, role-play, etc.) to your QA checks.
Use an LLM evaluator to triage candidate failures, then verify high-risk cases with humans.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation relies on an LLM judge (InstructGPT) and a simple verbalizer, which can introduce bias and false labels.
- Tests are mostly single-turn; multi-turn conversations are not covered.
- Sensitive Topics prompts were partially withheld, reducing full transparency.
- Decoding choices and randomness can change model outputs and scores.
When Not To Use
- Do not use benchmark scores as the only evidence of safety for deployment decisions.
- Avoid relying solely on the LLM evaluator for high-risk content without human review.
- Do not treat single-turn results as representative of multi-turn interactions.
Failure Modes
- LLM evaluator misses subtle unsafe content (false negatives).
- Adversarially crafted multi-turn prompts bypass single-turn tests.
- Augmentation process can produce highly toxic examples that bias tuning if used blindly.
Core Entities
Models
- ChatGPT
- GPT-4
- InstructGPT (text-davinci-003)
- GPT-3.5-turbo
- ChatGLM
- MiniChat
- OPD
- EVA
Metrics
- Per-scenario safety score (proportion of safe responses)
- Macro-average typical-scenarios score (Ā)
- Macro-average instruction-attacks score (Ḃ)
- Overall safety score (S)
Datasets
- Manually-written safety prompts (∼8.9k, Table 1)
- SAFETYPROMPTS (100k augmented prompts and responses)
Benchmarks
- Chinese LLM Safety Assessment (8 scenarios + 6 instruction attacks)
- Leaderboard at http://coai.cs.tsinghua.edu.cn/leaderboard/

