CVALUES: a Chinese benchmark that measures LLMs on safety (rejecting harms) and responsibility (giving helpful, caring guidance).

July 19, 20237 min

Overview

Decision SnapshotNeeds Validation

The benchmark provides actionable tests and open data, but results are limited to tested models, specific prompt pools, and Chinese cultural context; use both auto and human checks before rollout.

Citations13

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 50%

Authors

Guohai Xu, Jiayi Liu, Ming Yan, Haotian Xu, Jinghui Si, Zhuoran Zhou, Peng Yi, Xing Gao, Jitao Sang, Rong Zhang, Ji Zhang, Chao Peng, Fei Huang, Jingren Zhou

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Safety tuning reduces obvious harms, but models still fail to give responsible, empathetic, or legally careful answers; firms should test both rejection (safety) and guidance (responsibility) before deployment.

Who Should Care

Summary TLDR

CVALUES is a Chinese-focused benchmark that tests large language models on two value levels: Level‑1 safety (don’t produce harmful content) and Level‑2 responsibility (provide caring, societally aware guidance). The authors collected 2,100 adversarial and expert prompts, ran both human annotation and a multi‑choice automatic test (4,312 items), and evaluated popular Chinese and multilingual LLMs. Results: most instruction‑tuned models reject unsafe content well, but they often fail the higher bar of responsibility. The paper releases datasets and code for follow-up evaluation and model hardening.

Problem Statement

Existing Chinese LLM benchmarks measure knowledge and reasoning but not whether models follow human values. There is no comprehensive Chinese benchmark that tests both safety (avoid harm) and responsibility (give positive, socially aware guidance). This gap makes it hard to find value alignment failures before model release.

Main Contribution

CVALUES: first Chinese benchmark that explicitly tests two ascending value levels — safety and responsibility.

A mixed data collection: 1,300 adversarial safety prompts from crowdworkers and 800 responsibility prompts from domain experts.

Key Findings

Instruction‑tuned Chinese LLMs score high on human‑annotated safety.

NumbersChatGPT 96.9; Chinese‑Alpaca‑Plus‑7B 95.3; ChatGLM‑6B 95 (Table 2)

Practical UseIf you instruction‑tune on safety data (or use RLHF), expect solid performance rejecting overtly harmful prompts; use this as a baseline for release gating.

Evidence RefTable 2 (human safety scores)

Responsibility (helpful, empathetic, societally aware answers) is weaker than safety.

NumbersChatPLUG‑13B mean responsibility 6.5/10; Law 5.2; Social Science 2.2 (Table 3)

Practical UseDon't rely on rejection alone. Add domain expert review and targeted instruction/data for responsibility before deploying in sensitive areas (law, social science).

Evidence RefTable 3 (human responsibility scores)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Human safety score (proportion safe)ChatGPT 96.9; Chinese-Alpaca-Plus-7B 95.3; ChatGLM-6B 95; ChatPLUG-13B 94.7; Chinese-LLaMA-13B 53Human eval on 1,300 safety prompts (Section 2.2; Table 6; Table 2)Table 2: human evaluation safety scoresTable 2
Human responsibility score (1-10)ChatPLUG-13B mean 6.5; Environmental Science 8.7; Law 5.2; Social Science 2.2Expert annotation for 800 responsibility prompts (Section 2.2; Table 7; Table 3)Table 3: human responsibility scores by domainTable 3

What To Try In 7 Days

Run the CVALUES multi-choice suite to catch obvious comprehension failures quickly.

Collect 100 domain‑specific responsibility prompts from your product teams and run human review on model outputs.

Add targeted supervision data (expert responses) for the weakest domains (e.g., law, social science) and re-evaluate.

Reproducibility

Risks & Boundaries

Limitations

Responsibility labels were gathered for ChatPLUG‑13B only; cross‑model human responsibility data is limited.

Automatic multi‑choice prompts test comprehension, not real generation quality.

When Not To Use

As the sole check for model safety or responsibility—do not rely only on multi‑choice accuracy.

For languages or cultures outside Chinese without revalidation.

Failure Modes

High multi‑choice accuracy while model still generates unsafe content in free‑form outputs.

Models default to over‑helpfulness and provide actionable guidance for illegal or unsafe requests.

Core Entities

Models

ChatGPTChatGLM-6BBELLE-7B-2MChatPLUG-3.7BChatPLUG-13BMOSSChinese-LLaMA-13BChinese-Alpaca-Plus-7BChinese-Alpaca-Plus-13BZiya-LLaMA-13B

Metrics

human safety score (proportion safe)human responsibility score (1-10)Accuracy

Datasets

CVALUESCVALUES-COMPARISON100PoisonMpts

Benchmarks

CVALUES (safety & responsibility)