CVALUES: a Chinese benchmark that measures LLMs on safety (rejecting harms) and responsibility (giving helpful, caring guidance).

July 19, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

13

Authors

Guohai Xu, Jiayi Liu, Ming Yan, Haotian Xu, Jinghui Si, Zhuoran Zhou, Peng Yi, Xing Gao, Jitao Sang, Rong Zhang, Ji Zhang, Chao Peng, Fei Huang, Jingren Zhou

Links

Abstract / PDF

Why It Matters For Business

Safety tuning reduces obvious harms, but models still fail to give responsible, empathetic, or legally careful answers; firms should test both rejection (safety) and guidance (responsibility) before deployment.

Summary TLDR

CVALUES is a Chinese-focused benchmark that tests large language models on two value levels: Level‑1 safety (don’t produce harmful content) and Level‑2 responsibility (provide caring, societally aware guidance). The authors collected 2,100 adversarial and expert prompts, ran both human annotation and a multi‑choice automatic test (4,312 items), and evaluated popular Chinese and multilingual LLMs. Results: most instruction‑tuned models reject unsafe content well, but they often fail the higher bar of responsibility. The paper releases datasets and code for follow-up evaluation and model hardening.

Problem Statement

Existing Chinese LLM benchmarks measure knowledge and reasoning but not whether models follow human values. There is no comprehensive Chinese benchmark that tests both safety (avoid harm) and responsibility (give positive, socially aware guidance). This gap makes it hard to find value alignment failures before model release.

Main Contribution

CVALUES: first Chinese benchmark that explicitly tests two ascending value levels — safety and responsibility.

A mixed data collection: 1,300 adversarial safety prompts from crowdworkers and 800 responsibility prompts from domain experts.

Two evaluation pipelines: human annotation of raw responses and multi‑choice automatic evaluation using 4,312 constructed QA items.

Public release of the benchmark, code, and a 145k paired comparison set (CVALUES-COMPARISON) to support automatic training/evaluation.

Key Findings

Instruction‑tuned Chinese LLMs score high on human‑annotated safety.

NumbersChatGPT 96.9; Chinese‑Alpaca‑Plus‑7B 95.3; ChatGLM‑6B 95 (Table 2)

Responsibility (helpful, empathetic, societally aware answers) is weaker than safety.

NumbersChatPLUG‑13B mean responsibility 6.5/10; Law 5.2; Social Science 2.2 (Table 3)

Automatic multi‑choice tests and human evaluation measure different capabilities; models may score well on one and poorly on the other.

NumbersZiya‑LLaMA‑13B‑v1.1 auto Avg* 91.1 vs low human safety 77.8 (Tables 4 and 2)

Results

Human safety score (proportion safe)

ValueChatGPT 96.9; Chinese-Alpaca-Plus-7B 95.3; ChatGLM-6B 95; ChatPLUG-13B 94.7; Chinese-LLaMA-13B 53

Human responsibility score (1-10)

ValueChatPLUG-13B mean 6.5; Environmental Science 8.7; Law 5.2; Social Science 2.2

Accuracy

ValueChatGPT Level-1* 93.6 / Level-2* 92.8; ChatGLM-6B Level-1* 86.5 / Level-2* 74.6; MOSS Avg 45.5

Who Should Care

What To Try In 7 Days

Run the CVALUES multi-choice suite to catch obvious comprehension failures quickly.

Collect 100 domain‑specific responsibility prompts from your product teams and run human review on model outputs.

Add targeted supervision data (expert responses) for the weakest domains (e.g., law, social science) and re-evaluate.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Responsibility labels were gathered for ChatPLUG‑13B only; cross‑model human responsibility data is limited.
  • Automatic multi‑choice prompts test comprehension, not real generation quality.
  • Expert scoring is subjective and depends on the selected experts and domains.

When Not To Use

  • As the sole check for model safety or responsibility—do not rely only on multi‑choice accuracy.
  • For languages or cultures outside Chinese without revalidation.
  • As a replacement for legal compliance review in regulated products.

Failure Modes

  • High multi‑choice accuracy while model still generates unsafe content in free‑form outputs.
  • Models default to over‑helpfulness and provide actionable guidance for illegal or unsafe requests.
  • False rejections: helpful prompts are refused due to conservative safety tuning.

Core Entities

Models

  • ChatGPT
  • ChatGLM-6B
  • BELLE-7B-2M
  • ChatPLUG-3.7B
  • ChatPLUG-13B
  • MOSS
  • Chinese-LLaMA-13B
  • Chinese-Alpaca-Plus-7B
  • Chinese-Alpaca-Plus-13B
  • Ziya-LLaMA-13B

Metrics

  • human safety score (proportion safe)
  • human responsibility score (1-10)
  • Accuracy

Datasets

  • CVALUES
  • CVALUES-COMPARISON
  • 100PoisonMpts

Benchmarks

  • CVALUES (safety & responsibility)