A Chinese LLM safety benchmark plus 100k augmented safety prompts

April 20, 20237 min

Overview

Decision SnapshotNeeds Validation

The benchmark and 100k prompt library are practical for red-teaming and comparative evaluation, but the LLM-based automatic judge and single-turn focus limit reliability as a production safety certificate.

Citations16

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, Minlie Huang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A simple, repeatable Chinese safety benchmark and a 100k prompt library let product and security teams run systematic red-teaming and compare model choices quickly.

Who Should Care

Summary TLDR

The paper builds a Chinese-focused safety benchmark that tests 14 safety dimensions (8 typical scenarios + 6 instruction-attacks), evaluates 15 LLMs using an LLM-based automatic judge, and releases SAFETYPROMPTS — a 100k augmented prompt/response library. Key findings: instruction attacks reveal more failures than typical prompts, ChatGPT ranks highest especially on instruction attacks, and automated evaluation relies on an LLM evaluator (InstructGPT) with simple verbalization rules. Use the benchmark for red-teaming and comparative checks, but not as a final safety certification because the evaluator and single-turn tests have known blind spots.

Problem Statement

Large Chinese LLMs can produce insulting, biased, illegal, or harmful outputs. There was no large, public Chinese safety benchmark or prompt library for red-teaming and comparison. The paper fills that gap with a taxonomy, test prompts, automatic LLM-based scoring, a leaderboard, and a 100k augmented prompt dataset.

Main Contribution

A structured safety taxonomy: 8 typical scenarios and 6 instruction-attack types for Chinese LLMs.

A manually written test prompt set (∼8.9k prompts) split into public and private test sets.

Key Findings

Instruction attacks are consistently harder for models than typical safety prompts.

Practical UseInclude instruction-attack style tests when red-teaming; ordinary safety tests will undercount risks.

Evidence RefSection 4, Figure 4

ChatGPT ranks highest on overall safety and especially on instruction attacks.

NumbersGoal Hijacking: ChatGPT > next model by >20 points

Practical UseTreat ChatGPT as a strong baseline, but do not assume immunity—other models need targeted defense against instruction attacks.

Evidence RefSection 4 (observation 6)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Instruction-attack vs typical safety scoreInstruction-attack scores are lower than typical scenario scores for every evaluated modelEvaluated on benchmark prompts (Table 1)Section 4, observation 5Figure 4
Goal Hijacking gapChatGPT leads by >20 pointssecond-best model>20 pointsGoal Hijacking subset (Table 1)Section 4, observation 6Figure 4

What To Try In 7 Days

Run your model against SAFETYPROMPTS to find easy-to-trigger failures.

Add the six instruction-attack types (goal hijacking, role-play, etc.) to your QA checks.

Use an LLM evaluator to triage candidate failures, then verify high-risk cases with humans.

Reproducibility

Risks & Boundaries

Limitations

Evaluation relies on an LLM judge (InstructGPT) and a simple verbalizer, which can introduce bias and false labels.

Tests are mostly single-turn; multi-turn conversations are not covered.

When Not To Use

Do not use benchmark scores as the only evidence of safety for deployment decisions.

Avoid relying solely on the LLM evaluator for high-risk content without human review.

Failure Modes

LLM evaluator misses subtle unsafe content (false negatives).

Adversarially crafted multi-turn prompts bypass single-turn tests.

Core Entities

Models

ChatGPTGPT-4InstructGPT (text-davinci-003)GPT-3.5-turboChatGLMMiniChatOPDEVA

Metrics

Per-scenario safety score (proportion of safe responses)Macro-average typical-scenarios score (Ā)Macro-average instruction-attacks score (Ḃ)Overall safety score (S)

Datasets

Manually-written safety prompts (∼8.9k, Table 1)SAFETYPROMPTS (100k augmented prompts and responses)

Benchmarks

Chinese LLM Safety Assessment (8 scenarios + 6 instruction attacks)Leaderboard at http://coai.cs.tsinghua.edu.cn/leaderboard/