Overview
Production Readiness
0.3
Novelty Score
0.5
Cost Impact Score
0.3
Citation Count
7
Why It Matters For Business
MoralBench gives a repeatable way to compare LLMs on human moral judgments; use it to screen models before deploying them in ethically sensitive features.
Summary TLDR
The authors introduce MoralBench, a public benchmark that measures how closely LLMs match human moral judgments. They adapt two psychology tools—MFQ-30 questionnaire and MFV vignettes—into MFQ-30-LLM and MFV-LLM. Evaluation uses two modes: binary Agree/Disagree mapped to human average scores, and pairwise comparative choices scored by which statement humans rated higher. They test five models (Zephyr, LLaMA-2, Gemma-1.1, GPT-3.5, GPT-4). Results show variation by model and by task format: LLaMA-2 tops the MFQ-30 binary task (58.5 total), GPT-4 leads MFV binary (52.8), and GPT-3.5 scores highest in comparative tasks (e.g., 14.2 on MFV comparative). Key caveats: English-only, binary answers co
Problem Statement
There is no standard, systematic way to measure whether LLMs reflect human moral judgments. The paper builds a benchmark to quantify LLM alignment with human moral norms using established psychology tools.
Main Contribution
Design and release of MoralBench, a benchmark for LLM moral identity built from MFQ-30 questionnaire and MFV vignettes.
Two evaluation modes: binary Agree/Disagree mapped to human average scores, and pairwise comparative choices judged by higher human ratings.
A reproducible evaluation across five LLMs with basic analysis of where models agree or disagree with human judgments.
Key Findings
On the binary MFQ-30-LLM test, LLaMA-2 achieved the highest total moral score.
On the binary MFV-LLM test, GPT-4 achieved the highest total moral score.
In pairwise comparative tasks, GPT-3.5 scored best overall across datasets.
Performance varies by task format: high binary scores do not guarantee high comparative performance.
Results
MFQ-30-LLM total (binary)
MFV-LLM total (binary)
Comparative total (MFQ-30)
Comparative total (MFV)
Who Should Care
What To Try In 7 Days
Run MoralBench (MFQ-30-LLM and MFV-LLM) on your candidate LLMs to get baseline moral alignment.
Compare binary vs pairwise scores to detect superficial keyword matching versus deeper choice consistency.
If you rely on non-English users, do not use MoralBench as-is; plan to extend or translate the dataset first.
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- English-only benchmark; multilingual performance unknown
- Binary Agree/Disagree reduces nuance of human Likert responses
- Scores anchor to average human responses; any bias in those humans transfers to model scores
- No experiments on downstream real-world tasks or high-stakes deployment
When Not To Use
- When you need multilingual moral evaluation without translation
- When you require full moral reasoning explanations rather than choice alignment
- For high-stakes decisions without human oversight
Failure Modes
- Models may exploit keywords to get high binary scores without true understanding
- High performance on one format (binary) may not transfer to comparative judgments
- Results reflect the specific human sample and questionnaire design used
Core Entities
Models
- Zephyr
- LLaMA-2
- Gemma-1.1
- GPT-3.5
- GPT-4
Metrics
- total moral score
- binary agreement mapped to human mean
- Accuracy
Datasets
- MFQ-30-LLM
- MFV-LLM
- MoralBench
Benchmarks
- MoralBench

