Overview
The benchmark is a useful first step with public data and code, but it is English-only and uses simplified binary and pairwise formats that limit real-world readiness.
Citations7
Evidence Strength0.60
Confidence0.82
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 30%
Production readiness: 30%
Novelty: 50%
Why It Matters For Business
MoralBench gives a repeatable way to compare LLMs on human moral judgments; use it to screen models before deploying them in ethically sensitive features.
Who Should Care
Summary TLDR
The authors introduce MoralBench, a public benchmark that measures how closely LLMs match human moral judgments. They adapt two psychology tools—MFQ-30 questionnaire and MFV vignettes—into MFQ-30-LLM and MFV-LLM. Evaluation uses two modes: binary Agree/Disagree mapped to human average scores, and pairwise comparative choices scored by which statement humans rated higher. They test five models (Zephyr, LLaMA-2, Gemma-1.1, GPT-3.5, GPT-4). Results show variation by model and by task format: LLaMA-2 tops the MFQ-30 binary task (58.5 total), GPT-4 leads MFV binary (52.8), and GPT-3.5 scores highest in comparative tasks (e.g., 14.2 on MFV comparative). Key caveats: English-only, binary answers co
Problem Statement
There is no standard, systematic way to measure whether LLMs reflect human moral judgments. The paper builds a benchmark to quantify LLM alignment with human moral norms using established psychology tools.
Main Contribution
Design and release of MoralBench, a benchmark for LLM moral identity built from MFQ-30 questionnaire and MFV vignettes.
Two evaluation modes: binary Agree/Disagree mapped to human average scores, and pairwise comparative choices judged by higher human ratings.
Key Findings
On the binary MFQ-30-LLM test, LLaMA-2 achieved the highest total moral score.
On the binary MFV-LLM test, GPT-4 achieved the highest total moral score.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| MFQ-30-LLM total (binary) | LLaMA-2 58.5; GPT-4 56.6; GPT-3.5 54.7; Zephyr 54.2; Gemma-1.1 49.9 | — | — | MFQ-30-LLM (binary) | Table 1 MFQ-30-LLM totals | Table 1 |
| MFV-LLM total (binary) | GPT-4 52.8; LLaMA-2 52.6; GPT-3.5 50.3; Zephyr 48.1; Gemma-1.1 44.4 | — | — | MFV-LLM (binary) | Table 1 MFV-LLM totals | Table 1 |
What To Try In 7 Days
Run MoralBench (MFQ-30-LLM and MFV-LLM) on your candidate LLMs to get baseline moral alignment.
Compare binary vs pairwise scores to detect superficial keyword matching versus deeper choice consistency.
If you rely on non-English users, do not use MoralBench as-is; plan to extend or translate the dataset first.
Reproducibility
Risks & Boundaries
Limitations
English-only benchmark; multilingual performance unknown
Binary Agree/Disagree reduces nuance of human Likert responses
When Not To Use
When you need multilingual moral evaluation without translation
When you require full moral reasoning explanations rather than choice alignment
Failure Modes
Models may exploit keywords to get high binary scores without true understanding
High performance on one format (binary) may not transfer to comparative judgments

