Overview
Clear analytic conditions and diagrams make short games actionable; scaling and learning dynamics limit direct production use.
Citations1
Evidence Strength0.60
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/2
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 30%
Production readiness: 35%
Novelty: 60%
Why It Matters For Business
Designing agents with calibrated uncertainty can force them to request human oversight, lowering risk of harmful autonomous actions while trading off autonomy and throughput.
Who Should Care
Summary TLDR
The paper extends the off-switch (human oversight) idea to multi-agent settings by modeling agents and a human as a Bayesian game. Agents can take actions or request human supervision. The authors derive exact Nash-equilibrium conditions and a theorem (plus corollaries) that give when a defending agent is incentivized to ask the human. They plot phase diagrams showing 'corrigibility regions' that depend on agents' uncertainty over payoff structures and the human's rationality. Main limits: analysis is theoretical, scales poorly to many agents/actions, and learning dynamics can erode corrigibility.
Problem Statement
Can we make multiple autonomous agents reliably allow human intervention (corrigibility)? The work asks whether uncertainty over which payoff game is being played, together with a model of human rationality, can induce agents to request supervision in multi-agent interactions.
Main Contribution
Formalized multi-agent corrigibility as a Bayesian game that generalises the single-agent off-switch game.
Derived Nash-equilibrium conditions and plotted phase diagrams that show when agents prefer human supervision.
Key Findings
A defending agent is incentivized to ask the human when two derived inequalities (Theorem 1) hold.
If the human is fully rational (p = 1), any nonzero uncertainty in the defender's belief over payoff pairs makes asking the human optimal.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Theoretical condition | Two inequalities (Theorem 1) that determine when defender asks human | — | — | — | Theorem 1, equations (13)-(14) | Theorem 1 |
| Corrigibility region (qualitative) | Phase diagrams show blue regions where agents prefer supervision; region size grows with agent uncertainty and human r | — | — | 2x2 and sampled 3x3 symmetric games | Figures 1–4 phase diagrams | Figures 1-4 |
What To Try In 7 Days
Model defender as Bayesian over likely payoff games and compute Theorem 1 inequalities.
Simulate phase diagrams for your 2x2 (or sampled 3x3) interactions to find a corrigible region.
If human oversight is unreliable, add safeguards (limit reliance on human inputs).
Agent Features
Memory
Planning
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Risks & Boundaries
Limitations
Scales poorly: computing Nash equilibria becomes expensive with many agents/actions.
Assumes a fixed human rationality parameter p and often a fixed human strategy.
When Not To Use
When the overseer is unreliable or adversarial (p ≤ 1/2).
When the action space or number of agents prevents solving equilibria.
Failure Modes
Agents learn and converge to a single game, losing the uncertainty that caused corrigibility.
Mis-specified beliefs lead agents to avoid asking humans and act incorrigibly.

