MoralBench: a public benchmark that scores LLMs on moral statements using human-rated questionnaires and vignette pairs

June 6, 20247 min

Overview

Decision SnapshotNeeds Validation

The benchmark is a useful first step with public data and code, but it is English-only and uses simplified binary and pairwise formats that limit real-world readiness.

Citations7

Evidence Strength0.60

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 30%

Production readiness: 30%

Novelty: 50%

Authors

Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, Yongfeng Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MoralBench gives a repeatable way to compare LLMs on human moral judgments; use it to screen models before deploying them in ethically sensitive features.

Who Should Care

Summary TLDR

The authors introduce MoralBench, a public benchmark that measures how closely LLMs match human moral judgments. They adapt two psychology tools—MFQ-30 questionnaire and MFV vignettes—into MFQ-30-LLM and MFV-LLM. Evaluation uses two modes: binary Agree/Disagree mapped to human average scores, and pairwise comparative choices scored by which statement humans rated higher. They test five models (Zephyr, LLaMA-2, Gemma-1.1, GPT-3.5, GPT-4). Results show variation by model and by task format: LLaMA-2 tops the MFQ-30 binary task (58.5 total), GPT-4 leads MFV binary (52.8), and GPT-3.5 scores highest in comparative tasks (e.g., 14.2 on MFV comparative). Key caveats: English-only, binary answers co

Problem Statement

There is no standard, systematic way to measure whether LLMs reflect human moral judgments. The paper builds a benchmark to quantify LLM alignment with human moral norms using established psychology tools.

Main Contribution

Design and release of MoralBench, a benchmark for LLM moral identity built from MFQ-30 questionnaire and MFV vignettes.

Two evaluation modes: binary Agree/Disagree mapped to human average scores, and pairwise comparative choices judged by higher human ratings.

Key Findings

On the binary MFQ-30-LLM test, LLaMA-2 achieved the highest total moral score.

NumbersTotal = 58.5 (Table 1 MFQ-30-LLM)

Practical UseIf you need an off-the-shelf model that most closely matches human MFQ responses in binary format, test LLaMA-2 first.

Evidence RefTable 1

On the binary MFV-LLM test, GPT-4 achieved the highest total moral score.

NumbersTotal = 52.8 (Table 1 MFV-LLM)

Practical UseFor vignette-style moral scenarios in binary evaluation, GPT-4 showed the strongest alignment with the human-rated vignettes here.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MFQ-30-LLM total (binary)LLaMA-2 58.5; GPT-4 56.6; GPT-3.5 54.7; Zephyr 54.2; Gemma-1.1 49.9MFQ-30-LLM (binary)Table 1 MFQ-30-LLM totalsTable 1
MFV-LLM total (binary)GPT-4 52.8; LLaMA-2 52.6; GPT-3.5 50.3; Zephyr 48.1; Gemma-1.1 44.4MFV-LLM (binary)Table 1 MFV-LLM totalsTable 1

What To Try In 7 Days

Run MoralBench (MFQ-30-LLM and MFV-LLM) on your candidate LLMs to get baseline moral alignment.

Compare binary vs pairwise scores to detect superficial keyword matching versus deeper choice consistency.

If you rely on non-English users, do not use MoralBench as-is; plan to extend or translate the dataset first.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

English-only benchmark; multilingual performance unknown

Binary Agree/Disagree reduces nuance of human Likert responses

When Not To Use

When you need multilingual moral evaluation without translation

When you require full moral reasoning explanations rather than choice alignment

Failure Modes

Models may exploit keywords to get high binary scores without true understanding

High performance on one format (binary) may not transfer to comparative judgments

Core Entities

Models

ZephyrLLaMA-2Gemma-1.1GPT-3.5GPT-4

Metrics

total moral scorebinary agreement mapped to human meanAccuracy

Datasets

MFQ-30-LLMMFV-LLMMoralBench

Benchmarks

MoralBench