MoralBench: a public benchmark that scores LLMs on moral statements using human-rated questionnaires and vignette pairs

Overview

Decision SnapshotNeeds Validation

The benchmark is a useful first step with public data and code, but it is English-only and uses simplified binary and pairwise formats that limit real-world readiness.

Citations7

Evidence Strength0.60

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 30%

Production readiness: 30%

Novelty: 50%

Authors

Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, Yongfeng Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MoralBench gives a repeatable way to compare LLMs on human moral judgments; use it to screen models before deploying them in ethically sensitive features.

Who Should Care

Product Manager ML Engineer Founder CTO

Summary TLDR

The authors introduce MoralBench, a public benchmark that measures how closely LLMs match human moral judgments. They adapt two psychology tools—MFQ-30 questionnaire and MFV vignettes—into MFQ-30-LLM and MFV-LLM. Evaluation uses two modes: binary Agree/Disagree mapped to human average scores, and pairwise comparative choices scored by which statement humans rated higher. They test five models (Zephyr, LLaMA-2, Gemma-1.1, GPT-3.5, GPT-4). Results show variation by model and by task format: LLaMA-2 tops the MFQ-30 binary task (58.5 total), GPT-4 leads MFV binary (52.8), and GPT-3.5 scores highest in comparative tasks (e.g., 14.2 on MFV comparative). Key caveats: English-only, binary answers co

Problem Statement

There is no standard, systematic way to measure whether LLMs reflect human moral judgments. The paper builds a benchmark to quantify LLM alignment with human moral norms using established psychology tools.

Main Contribution

Design and release of MoralBench, a benchmark for LLM moral identity built from MFQ-30 questionnaire and MFV vignettes.

Two evaluation modes: binary Agree/Disagree mapped to human average scores, and pairwise comparative choices judged by higher human ratings.

Key Findings

On the binary MFQ-30-LLM test, LLaMA-2 achieved the highest total moral score.

NumbersTotal = 58.5 (Table 1 MFQ-30-LLM)

Practical UseIf you need an off-the-shelf model that most closely matches human MFQ responses in binary format, test LLaMA-2 first.

Evidence RefTable 1

On the binary MFV-LLM test, GPT-4 achieved the highest total moral score.

NumbersTotal = 52.8 (Table 1 MFV-LLM)

Practical UseFor vignette-style moral scenarios in binary evaluation, GPT-4 showed the strongest alignment with the human-rated vignettes here.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
MFQ-30-LLM total (binary)	LLaMA-2 58.5; GPT-4 56.6; GPT-3.5 54.7; Zephyr 54.2; Gemma-1.1 49.9	—	—	MFQ-30-LLM (binary)	Table 1 MFQ-30-LLM totals	Table 1
MFV-LLM total (binary)	GPT-4 52.8; LLaMA-2 52.6; GPT-3.5 50.3; Zephyr 48.1; Gemma-1.1 44.4	—	—	MFV-LLM (binary)	Table 1 MFV-LLM totals	Table 1

What To Try In 7 Days

Run MoralBench (MFQ-30-LLM and MFV-LLM) on your candidate LLMs to get baseline moral alignment.

Compare binary vs pairwise scores to detect superficial keyword matching versus deeper choice consistency.

If you rely on non-English users, do not use MoralBench as-is; plan to extend or translate the dataset first.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/agiresearch/MoralBench

Data URLs

https://drive.google.com/drive/u/0/folders/1k93YZJserYc2CkqP8d4B3M3sgd3kA8W7

Risks & Boundaries

Limitations

English-only benchmark; multilingual performance unknown

Binary Agree/Disagree reduces nuance of human Likert responses

When Not To Use

When you need multilingual moral evaluation without translation

When you require full moral reasoning explanations rather than choice alignment

Failure Modes

Models may exploit keywords to get high binary scores without true understanding

High performance on one format (binary) may not transfer to comparative judgments

Core Entities

Models

ZephyrLLaMA-2Gemma-1.1GPT-3.5GPT-4

Metrics

total moral scorebinary agreement mapped to human meanAccuracy

Datasets

MFQ-30-LLMMFV-LLMMoralBench

Benchmarks

MoralBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

On the binary MFQ-30-LLM test, LLaMA-2 achieved the highest total moral score.

On the binary MFV-LLM test, GPT-4 achieved the highest total moral score.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding