Overview
The manifesto synthesises literature and small-scale case studies. Concepts are actionable but require more empirical testing and engineering to be production‑ready.
Citations1
Evidence Strength0.60
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 1/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/2
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 45%
Production readiness: 35%
Novelty: 60%
Why It Matters For Business
Hybrid moral alignment helps build AI that is both controllable (auditable rules) and adaptable (learned behavior), reducing legal, reputational and safety risks in agentic products.
Who Should Care
Summary TLDR
The paper surveys methods for embedding moral values into AI agents and argues that purely rule-based (top-down) or purely learned (bottom-up) systems both have major weaknesses. It proposes hybrid solutions that encode explicit moral principles (rules, prompts or reward formulas) while letting agents learn via reinforcement learning (RL) or fine-tune LLMs. The authors present four case studies (safety-constrained RL, Constitutional AI, intrinsic moral rewards in social dilemmas, and LLM fine-tuning with intrinsic rewards), formalise several intrinsic reward functions, and recommend evaluation metrics (collective payoff, equality, minimum payoff, cooperation). Practical risks include reward‑
Problem Statement
Hard-coded rules lack flexibility and coverage. Pure learning from data lacks guarantees, is sample-inefficient, and is vulnerable to reward-hacking or data poisoning. We need methods that are both controllable and adaptable for agentic AI.
Main Contribution
Systematizes moral-alignment approaches on a continuum from top-down rules to bottom-up learning and highlights the gap between extremes.
Argues for hybrid methods that combine explicit moral principles with learning to gain interpretability and adaptability.
Key Findings
Most existing approaches lie at two extremes: fully top-down rules or fully bottom-up learned preferences.
Hybrid intrinsic-reward methods can drive cooperative policies in social-dilemma simulations.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Policy cooperation | Hybrid intrinsic-reward agents learned fully cooperative policies in social-dilemma simulations given sufficient explore | Selfish / extrinsic-reward agents | — | Iterated 2x2 social-dilemma games (IPD, IVD, ISH) as in Tennant et al. (2023) | Tennant et al. (2023) report moral agents can learn cooperative policies; equality reward learned least efficiently | Section 4.2; Tennant et al. (2023) |
| Harmlessness of LLM outputs | Improved using Constitutional AI feedback | Vanilla pre-trained LLM | — | Constitutional AI experiments reported by Bai et al. (2022) | Bai et al. (2022) used constitution-based critic models to fine-tune reward models that reduce harmful outputs | Section 3.3 |
What To Try In 7 Days
Prototype an iterated Prisoner's Dilemma with simple intrinsic rewards (sum-of-payoffs, equality) to observe cooperation dynamics.
Run a safety-constrained RL experiment (action shielding) on a toy control task to see how constraints affect learning.
Apply constitution-style prompts and a small reward model to a chatbot to test harmlessness improvements and overfitting to tokens.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Translating rich moral theories to scalar rewards oversimplifies values and contexts.
Few existing hybrid implementations; evidence mainly from controlled simulations and limited LLM experiments.
When Not To Use
High-stakes deployments without human oversight or auditing.
Open-ended tasks where no clear scalar payoff or tokenized actions exist.
Failure Modes
Reward‑hacking: agent optimises unexpected proxies and behaves unsafely.
Data/environment poisoning: adversarial training data corrupts learned values.

