Overview
The benchmark and 100k prompt library are practical for red-teaming and comparative evaluation, but the LLM-based automatic judge and single-turn focus limit reliability as a production safety certificate.
Citations16
Evidence Strength0.60
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/6
Findings with evidence refs: 6/6
Results with explicit delta: 2/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
A simple, repeatable Chinese safety benchmark and a 100k prompt library let product and security teams run systematic red-teaming and compare model choices quickly.
Who Should Care
Summary TLDR
The paper builds a Chinese-focused safety benchmark that tests 14 safety dimensions (8 typical scenarios + 6 instruction-attacks), evaluates 15 LLMs using an LLM-based automatic judge, and releases SAFETYPROMPTS — a 100k augmented prompt/response library. Key findings: instruction attacks reveal more failures than typical prompts, ChatGPT ranks highest especially on instruction attacks, and automated evaluation relies on an LLM evaluator (InstructGPT) with simple verbalization rules. Use the benchmark for red-teaming and comparative checks, but not as a final safety certification because the evaluator and single-turn tests have known blind spots.
Problem Statement
Large Chinese LLMs can produce insulting, biased, illegal, or harmful outputs. There was no large, public Chinese safety benchmark or prompt library for red-teaming and comparison. The paper fills that gap with a taxonomy, test prompts, automatic LLM-based scoring, a leaderboard, and a 100k augmented prompt dataset.
Main Contribution
A structured safety taxonomy: 8 typical scenarios and 6 instruction-attack types for Chinese LLMs.
A manually written test prompt set (∼8.9k prompts) split into public and private test sets.
Key Findings
Instruction attacks are consistently harder for models than typical safety prompts.
ChatGPT ranks highest on overall safety and especially on instruction attacks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Instruction-attack vs typical safety score | Instruction-attack scores are lower than typical scenario scores for every evaluated model | — | — | Evaluated on benchmark prompts (Table 1) | Section 4, observation 5 | Figure 4 |
| Goal Hijacking gap | ChatGPT leads by >20 points | second-best model | >20 points | Goal Hijacking subset (Table 1) | Section 4, observation 6 | Figure 4 |
What To Try In 7 Days
Run your model against SAFETYPROMPTS to find easy-to-trigger failures.
Add the six instruction-attack types (goal hijacking, role-play, etc.) to your QA checks.
Use an LLM evaluator to triage candidate failures, then verify high-risk cases with humans.
Reproducibility
Risks & Boundaries
Limitations
Evaluation relies on an LLM judge (InstructGPT) and a simple verbalizer, which can introduce bias and false labels.
Tests are mostly single-turn; multi-turn conversations are not covered.
When Not To Use
Do not use benchmark scores as the only evidence of safety for deployment decisions.
Avoid relying solely on the LLM evaluator for high-risk content without human review.
Failure Modes
LLM evaluator misses subtle unsafe content (false negatives).
Adversarially crafted multi-turn prompts bypass single-turn tests.

