Overview
Production Readiness
0.35
Novelty Score
0.65
Cost Impact Score
0.4
Citation Count
2
Why It Matters For Business
LLMs change their moral choices and explanations depending on ethical prompts; pick prompt frames (fairness/altruism/virtue) and add consistency checks before using LLMs in policy, legal, or clinical workflows.
Summary TLDR
The authors test 14 leading LLMs across 27 trolley-style dilemmas, each framed by ten ethical philosophies, producing 3,780 responses (binary decisions + justifications). They measure intervention rate, explanation-answer consistency, alignment with aggregated human votes, and sensitivity to irrelevant cues (e.g., kinship, species, bribery). Key findings: reasoning prompts make models more decisive and produce longer explanations but do not always improve alignment with human consensus; Fairness, Altruism, and Virtue prompts form a practical 'sweet spot' that balances action, low contradiction, and closer human alignment; Familial and Lawful frames often produce off-target biases (strong kin
Problem Statement
LLMs increasingly mediate sensitive decisions. This paper asks: how do leading models behave on moral dilemmas, and how do different ethical prompts change decisions and explanations? The aim is to measure decisiveness, explanation fidelity, public alignment, and sensitivity to irrelevant factors using trolley-style scenarios framed by ten moral philosophies.
Main Contribution
Large-scale cross-provider evaluation of 14 LLMs on 27 trolley dilemmas with 10 ethical frames (3,780 queries).
Introduce and report metrics: intervention rate, explanation-answer consistency, KL divergence to aggregated human votes, and contextual bias sensitivity.
Compare reasoning-enabled variants to non-reasoning variants and identify prompting 'sweet zones' (Fairness, Altruism, Virtue) that balance action, coherence, and human alignment.
Key Findings
Reasoning prompts increase decisiveness but do not ensure human alignment.
Fairness, Altruism, and Virtue frames form a practical sweet spot.
Some frames produce strong off-target biases (kinship, bribery, species).
Deontology yields the largest explanation-action conflict.
Aggregate model outputs deviate from human majorities.
Results
Utilitarian frame average Yes rate
Deontology explanation-action conflict rate
Default framing human-alignment (KL divergence)
Who Should Care
What To Try In 7 Days
Run your key prompts through fairness, altruism, and virtue frames and compare Yes rates and explanation conflicts to spot risky shifts.
Add a simple explanation-action consistency check: flag responses where the text justification contradicts the binary decision.
Benchmark your chosen model against a small, domain-relevant human vote set before deploying moral or normative advice.
Reproducibility
Data Urls
- Absurd Trolley Problems dataset (public aggregated human votes) as used in paper
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- English-only evaluation; no multilingual checks.
- Most models are proprietary; findings may reflect provider policies as much as model internals.
- Safety filters were disabled where possible, which may not reflect production settings.
- Benchmarks use trolley-style dilemmas, which are simplified and may not generalize to complex real-world ethics.
When Not To Use
- Do not use these prompt framings as final safety controls in high-stakes systems without human oversight.
- Avoid relying solely on Yes/No outputs; explanations can contradict decisions.
- Do not assume reasoning-enabled variants are safer or better aligned without domain validation.
Failure Modes
- Reasoning amplifies confident but misaligned answers (overcommitment to abstract principles).
- Rule-based prompts (Deontology, Lawful) can produce high explanation-action conflict.
- Familial and Lawful frames cause large off-target spillovers (kinship and legal rigidity).
- Models can be brittle to small, morally irrelevant changes (species, bribery), producing inconsistent outcomes.
Core Entities
Models
- o4-mini (OpenAI)
- o3 (OpenAI)
- o3-mini (OpenAI)
- GPT-4o (OpenAI)
- Opus 4 (Anthropic)
- Sonnet 4 (Anthropic)
- Sonnet 3.7 (Anthropic)
- Gemini 2.5 Pro (Google DeepMind)
- Grok-3 (xAI)
- Grok-3 Mini (xAI)
- DeepSeek R1 (DeepSeek)
- DeepSeek V3 (DeepSeek)
- Qwen 3 (Alibaba Cloud)
- Qwen 3 (non-reasoning variant)
Metrics
- Intervention Rate (Yes Rate)
- Explanation-Answer Consistency (Conflict Rate)
- KL Divergence to Human Votes
- Contextual Bias Sensitivity (spillover index)
Datasets
- Absurd Trolley Problems dataset (public aggregated human votes)
Benchmarks
- 27 trolley-style scenarios × 10 ethical frames

