Argues for hybrid moral alignment: combine explicit moral principles with learning to get safer, adaptable agents

Overview

Decision SnapshotNeeds Validation

The manifesto synthesises literature and small-scale case studies. Concepts are actionable but require more empirical testing and engineering to be production‑ready.

Citations1

Evidence Strength0.60

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 1/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/2

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 45%

Production readiness: 35%

Novelty: 60%

Authors

Elizaveta Tennant, Stephen Hailes, Mirco Musolesi

Links

Abstract / PDF

Why It Matters For Business

Hybrid moral alignment helps build AI that is both controllable (auditable rules) and adaptable (learned behavior), reducing legal, reputational and safety risks in agentic products.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

The paper surveys methods for embedding moral values into AI agents and argues that purely rule-based (top-down) or purely learned (bottom-up) systems both have major weaknesses. It proposes hybrid solutions that encode explicit moral principles (rules, prompts or reward formulas) while letting agents learn via reinforcement learning (RL) or fine-tune LLMs. The authors present four case studies (safety-constrained RL, Constitutional AI, intrinsic moral rewards in social dilemmas, and LLM fine-tuning with intrinsic rewards), formalise several intrinsic reward functions, and recommend evaluation metrics (collective payoff, equality, minimum payoff, cooperation). Practical risks include reward‑

Problem Statement

Hard-coded rules lack flexibility and coverage. Pure learning from data lacks guarantees, is sample-inefficient, and is vulnerable to reward-hacking or data poisoning. We need methods that are both controllable and adaptable for agentic AI.

Main Contribution

Systematizes moral-alignment approaches on a continuum from top-down rules to bottom-up learning and highlights the gap between extremes.

Argues for hybrid methods that combine explicit moral principles with learning to gain interpretability and adaptability.

Key Findings

Most existing approaches lie at two extremes: fully top-down rules or fully bottom-up learned preferences.

Practical UsePrefer hybrid designs: combine explicit principles with learning rather than only rules or only data-driven methods.

Evidence RefSections 1-2, Table 1

Hybrid intrinsic-reward methods can drive cooperative policies in social-dilemma simulations.

NumbersTennant et al. (2023): moral agents learned cooperative policies given sufficient exploration

Practical UseIn multi-agent simulations, add intrinsic moral rewards (e.g., sum-of-payoffs or equality terms) to encourage cooperation.

Evidence RefSections 3.4.3-3.4.4; Tennant et al. (2023)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Policy cooperation	Hybrid intrinsic-reward agents learned fully cooperative policies in social-dilemma simulations given sufficient explore	Selfish / extrinsic-reward agents	—	Iterated 2x2 social-dilemma games (IPD, IVD, ISH) as in Tennant et al. (2023)	Tennant et al. (2023) report moral agents can learn cooperative policies; equality reward learned least efficiently	Section 4.2; Tennant et al. (2023)
Harmlessness of LLM outputs	Improved using Constitutional AI feedback	Vanilla pre-trained LLM	—	Constitutional AI experiments reported by Bai et al. (2022)	Bai et al. (2022) used constitution-based critic models to fine-tune reward models that reduce harmful outputs	Section 3.3

What To Try In 7 Days

Prototype an iterated Prisoner's Dilemma with simple intrinsic rewards (sum-of-payoffs, equality) to observe cooperation dynamics.

Run a safety-constrained RL experiment (action shielding) on a toy control task to see how constraints affect learning.

Apply constitution-style prompts and a small reward model to a chatbot to test harmlessness improvements and overfitting to tokens.

Agent Features

Memory

state-history based short-term memory (previous actions as state)

Planning

RLon-policy fine-tuning (PPO)

Tool Use

LLM as decision-maker (tokenized actions)

Frameworks

RLInverse RLRLHFConstitutional AI

Is Agentic

Yes

Architectures

model-free RL agents (Q-learning, DQN)policy-based RL (PPO)LLM agents fine-tuned with RL

Collaboration

decentralised multi-agent learningpartner selection mechanisms

Optimization Features

Training Optimization

multi-objective reward designcurriculum / interim rewards to mitigate sample inefficiency

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Translating rich moral theories to scalar rewards oversimplifies values and contexts.

Few existing hybrid implementations; evidence mainly from controlled simulations and limited LLM experiments.

When Not To Use

High-stakes deployments without human oversight or auditing.

Open-ended tasks where no clear scalar payoff or tokenized actions exist.

Failure Modes

Reward‑hacking: agent optimises unexpected proxies and behaves unsafely.

Data/environment poisoning: adversarial training data corrupts learned values.

Core Entities

Models

Q-learningDQNPPOLLM (general)RLHFConstitutional AI

Metrics

collective payoff (cumulative return)Gini coefficient (equality)minimum payoff (Rawlsian min)cooperation rate

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Most existing approaches lie at two extremes: fully top-down rules or fully bottom-up learned preferences.

Hybrid intrinsic-reward methods can drive cooperative policies in social-dilemma simulations.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Context Entities

Models

Metrics

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding