MoralBench: a public benchmark that scores LLMs on moral statements using human-rated questionnaires and vignette pairs

June 6, 20247 min

Overview

Production Readiness

0.3

Novelty Score

0.5

Cost Impact Score

0.3

Citation Count

7

Authors

Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, Yongfeng Zhang

Links

Abstract / PDF

Why It Matters For Business

MoralBench gives a repeatable way to compare LLMs on human moral judgments; use it to screen models before deploying them in ethically sensitive features.

Summary TLDR

The authors introduce MoralBench, a public benchmark that measures how closely LLMs match human moral judgments. They adapt two psychology tools—MFQ-30 questionnaire and MFV vignettes—into MFQ-30-LLM and MFV-LLM. Evaluation uses two modes: binary Agree/Disagree mapped to human average scores, and pairwise comparative choices scored by which statement humans rated higher. They test five models (Zephyr, LLaMA-2, Gemma-1.1, GPT-3.5, GPT-4). Results show variation by model and by task format: LLaMA-2 tops the MFQ-30 binary task (58.5 total), GPT-4 leads MFV binary (52.8), and GPT-3.5 scores highest in comparative tasks (e.g., 14.2 on MFV comparative). Key caveats: English-only, binary answers co

Problem Statement

There is no standard, systematic way to measure whether LLMs reflect human moral judgments. The paper builds a benchmark to quantify LLM alignment with human moral norms using established psychology tools.

Main Contribution

Design and release of MoralBench, a benchmark for LLM moral identity built from MFQ-30 questionnaire and MFV vignettes.

Two evaluation modes: binary Agree/Disagree mapped to human average scores, and pairwise comparative choices judged by higher human ratings.

A reproducible evaluation across five LLMs with basic analysis of where models agree or disagree with human judgments.

Key Findings

On the binary MFQ-30-LLM test, LLaMA-2 achieved the highest total moral score.

NumbersTotal = 58.5 (Table 1 MFQ-30-LLM)

On the binary MFV-LLM test, GPT-4 achieved the highest total moral score.

NumbersTotal = 52.8 (Table 1 MFV-LLM)

In pairwise comparative tasks, GPT-3.5 scored best overall across datasets.

NumbersMFV comparative total = 14.2; MFQ comparative total = 12.4 (Table 2)

Performance varies by task format: high binary scores do not guarantee high comparative performance.

NumbersModels with high binary totals (e.g., LLaMA-2) are not always top in comparative totals (Table 1 vs Table 2)

Results

MFQ-30-LLM total (binary)

ValueLLaMA-2 58.5; GPT-4 56.6; GPT-3.5 54.7; Zephyr 54.2; Gemma-1.1 49.9

MFV-LLM total (binary)

ValueGPT-4 52.8; LLaMA-2 52.6; GPT-3.5 50.3; Zephyr 48.1; Gemma-1.1 44.4

Comparative total (MFQ-30)

ValueGPT-3.5 12.4; GPT-4 9.8; Gemma-1.1 9.6; Zephyr 8.2; LLaMA-2 8.0

Comparative total (MFV)

ValueGPT-3.5 14.2; GPT-4 13.8; LLaMA-2 13.2; Gemma-1.1 10.8; Zephyr 10.4

Who Should Care

What To Try In 7 Days

Run MoralBench (MFQ-30-LLM and MFV-LLM) on your candidate LLMs to get baseline moral alignment.

Compare binary vs pairwise scores to detect superficial keyword matching versus deeper choice consistency.

If you rely on non-English users, do not use MoralBench as-is; plan to extend or translate the dataset first.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • English-only benchmark; multilingual performance unknown
  • Binary Agree/Disagree reduces nuance of human Likert responses
  • Scores anchor to average human responses; any bias in those humans transfers to model scores
  • No experiments on downstream real-world tasks or high-stakes deployment

When Not To Use

  • When you need multilingual moral evaluation without translation
  • When you require full moral reasoning explanations rather than choice alignment
  • For high-stakes decisions without human oversight

Failure Modes

  • Models may exploit keywords to get high binary scores without true understanding
  • High performance on one format (binary) may not transfer to comparative judgments
  • Results reflect the specific human sample and questionnaire design used

Core Entities

Models

  • Zephyr
  • LLaMA-2
  • Gemma-1.1
  • GPT-3.5
  • GPT-4

Metrics

  • total moral score
  • binary agreement mapped to human mean
  • Accuracy

Datasets

  • MFQ-30-LLM
  • MFV-LLM
  • MoralBench

Benchmarks

  • MoralBench