Large LLMs show predictable moral shifts under different ethical prompts; fairness, altruism, and virtue prompts hit a practical 'sweet spot

August 10, 20258 min

Overview

Production Readiness

0.35

Novelty Score

0.65

Cost Impact Score

0.4

Citation Count

2

Authors

Junchen Ding, Penghao Jiang, Zihao Xu, Ziqi Ding, Yichen Zhu, Jiaojiao Jiang, Yuekang Li

Links

Abstract / PDF

Why It Matters For Business

LLMs change their moral choices and explanations depending on ethical prompts; pick prompt frames (fairness/altruism/virtue) and add consistency checks before using LLMs in policy, legal, or clinical workflows.

Summary TLDR

The authors test 14 leading LLMs across 27 trolley-style dilemmas, each framed by ten ethical philosophies, producing 3,780 responses (binary decisions + justifications). They measure intervention rate, explanation-answer consistency, alignment with aggregated human votes, and sensitivity to irrelevant cues (e.g., kinship, species, bribery). Key findings: reasoning prompts make models more decisive and produce longer explanations but do not always improve alignment with human consensus; Fairness, Altruism, and Virtue prompts form a practical 'sweet spot' that balances action, low contradiction, and closer human alignment; Familial and Lawful frames often produce off-target biases (strong kin

Problem Statement

LLMs increasingly mediate sensitive decisions. This paper asks: how do leading models behave on moral dilemmas, and how do different ethical prompts change decisions and explanations? The aim is to measure decisiveness, explanation fidelity, public alignment, and sensitivity to irrelevant factors using trolley-style scenarios framed by ten moral philosophies.

Main Contribution

Large-scale cross-provider evaluation of 14 LLMs on 27 trolley dilemmas with 10 ethical frames (3,780 queries).

Introduce and report metrics: intervention rate, explanation-answer consistency, KL divergence to aggregated human votes, and contextual bias sensitivity.

Compare reasoning-enabled variants to non-reasoning variants and identify prompting 'sweet zones' (Fairness, Altruism, Virtue) that balance action, coherence, and human alignment.

Key Findings

Reasoning prompts increase decisiveness but do not ensure human alignment.

NumbersReasoning variants raise Yes rates (e.g., +7 pp for Qwen/Gemini) but best public match ~59%

Fairness, Altruism, and Virtue frames form a practical sweet spot.

NumbersFairness: 67% Yes, 6% conflict, KL=0.68; Altruism: 76% Yes, 6% conflict, KL=0.72; Virtue: 80% Yes, 5% conflict, KL=0.73

Some frames produce strong off-target biases (kinship, bribery, species).

NumbersFamilial frame: 31% Yes, 75% bribery acceptance; Familial and Lawful spillover index up to 0.75

Deontology yields the largest explanation-action conflict.

NumbersDeontology conflict rate = 14% (highest across frames)

Aggregate model outputs deviate from human majorities.

NumbersDefault framing average: 66% Yes, 8% conflict, KL = 0.76; top models match human majority ≈59%

Results

Utilitarian frame average Yes rate

Value82%

Deontology explanation-action conflict rate

Value14%

Default framing human-alignment (KL divergence)

Value0.76

Who Should Care

What To Try In 7 Days

Run your key prompts through fairness, altruism, and virtue frames and compare Yes rates and explanation conflicts to spot risky shifts.

Add a simple explanation-action consistency check: flag responses where the text justification contradicts the binary decision.

Benchmark your chosen model against a small, domain-relevant human vote set before deploying moral or normative advice.

Reproducibility

Data Urls

  • Absurd Trolley Problems dataset (public aggregated human votes) as used in paper

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • English-only evaluation; no multilingual checks.
  • Most models are proprietary; findings may reflect provider policies as much as model internals.
  • Safety filters were disabled where possible, which may not reflect production settings.
  • Benchmarks use trolley-style dilemmas, which are simplified and may not generalize to complex real-world ethics.

When Not To Use

  • Do not use these prompt framings as final safety controls in high-stakes systems without human oversight.
  • Avoid relying solely on Yes/No outputs; explanations can contradict decisions.
  • Do not assume reasoning-enabled variants are safer or better aligned without domain validation.

Failure Modes

  • Reasoning amplifies confident but misaligned answers (overcommitment to abstract principles).
  • Rule-based prompts (Deontology, Lawful) can produce high explanation-action conflict.
  • Familial and Lawful frames cause large off-target spillovers (kinship and legal rigidity).
  • Models can be brittle to small, morally irrelevant changes (species, bribery), producing inconsistent outcomes.

Core Entities

Models

  • o4-mini (OpenAI)
  • o3 (OpenAI)
  • o3-mini (OpenAI)
  • GPT-4o (OpenAI)
  • Opus 4 (Anthropic)
  • Sonnet 4 (Anthropic)
  • Sonnet 3.7 (Anthropic)
  • Gemini 2.5 Pro (Google DeepMind)
  • Grok-3 (xAI)
  • Grok-3 Mini (xAI)
  • DeepSeek R1 (DeepSeek)
  • DeepSeek V3 (DeepSeek)
  • Qwen 3 (Alibaba Cloud)
  • Qwen 3 (non-reasoning variant)

Metrics

  • Intervention Rate (Yes Rate)
  • Explanation-Answer Consistency (Conflict Rate)
  • KL Divergence to Human Votes
  • Contextual Bias Sensitivity (spillover index)

Datasets

  • Absurd Trolley Problems dataset (public aggregated human votes)

Benchmarks

  • 27 trolley-style scenarios × 10 ethical frames