How uncertainty can make multi-agent systems ask humans for supervision

January 9, 20256 min

Overview

Production Readiness

0.35

Novelty Score

0.6

Cost Impact Score

0.3

Citation Count

1

Authors

Edmund Dable-Heath, Boyko Vodenicharski, James Bishop

Links

Abstract / PDF

Why It Matters For Business

Designing agents with calibrated uncertainty can force them to request human oversight, lowering risk of harmful autonomous actions while trading off autonomy and throughput.

Summary TLDR

The paper extends the off-switch (human oversight) idea to multi-agent settings by modeling agents and a human as a Bayesian game. Agents can take actions or request human supervision. The authors derive exact Nash-equilibrium conditions and a theorem (plus corollaries) that give when a defending agent is incentivized to ask the human. They plot phase diagrams showing 'corrigibility regions' that depend on agents' uncertainty over payoff structures and the human's rationality. Main limits: analysis is theoretical, scales poorly to many agents/actions, and learning dynamics can erode corrigibility.

Problem Statement

Can we make multiple autonomous agents reliably allow human intervention (corrigibility)? The work asks whether uncertainty over which payoff game is being played, together with a model of human rationality, can induce agents to request supervision in multi-agent interactions.

Main Contribution

Formalized multi-agent corrigibility as a Bayesian game that generalises the single-agent off-switch game.

Derived Nash-equilibrium conditions and plotted phase diagrams that show when agents prefer human supervision.

Analysed a defender/adversary special case and proved a theorem with explicit inequalities that guarantee asking-for-help.

Showed corollaries for extreme human rationality: p=1, p=1/2, p=0 and discussed learning dynamics that can undo corrigibility.

Key Findings

A defending agent is incentivized to ask the human when two derived inequalities (Theorem 1) hold.

If the human is fully rational (p = 1), any nonzero uncertainty in the defender's belief over payoff pairs makes asking the human optimal.

Numbersp = 1

If the human is maximally random (p = 1/2), the defender is at best equally likely to ask or act independently.

Numbersp = 1/2

If the human is adversarial/misaligned (p = 0), the defender will never be incentivized to ask the human.

Numbersp = 0

Phase diagrams show a corr igibility region: higher agent uncertainty generally increases the chance agents ask humans; for 2x2 games the relationship is near-linear.

Results

Theoretical condition

ValueTwo inequalities (Theorem 1) that determine when defender asks human

Corrigibility region (qualitative)

ValuePhase diagrams show blue regions where agents prefer supervision; region size grows with agent uncertainty and human r

Who Should Care

What To Try In 7 Days

Model defender as Bayesian over likely payoff games and compute Theorem 1 inequalities.

Simulate phase diagrams for your 2x2 (or sampled 3x3) interactions to find a corrigible region.

If human oversight is unreliable, add safeguards (limit reliance on human inputs).

Agent Features

Memory

  • belief over games (Bayesian)

Planning

  • ask human for supervision
  • direct action choices (α/β)

Frameworks

  • Harsanyi transformation
  • off-switch game formalism

Is Agentic

true

Architectures

  • game-theoretic Bayesian agents

Collaboration

  • defender/adversary interaction
  • human-in-the-loop supervision

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Scales poorly: computing Nash equilibria becomes expensive with many agents/actions.
  • Assumes a fixed human rationality parameter p and often a fixed human strategy.
  • Analysis is theoretical with illustrative simulations, not deployed experiments.
  • Learning/update dynamics can remove the uncertainty that induces corrigibility.

When Not To Use

  • When the overseer is unreliable or adversarial (p ≤ 1/2).
  • When the action space or number of agents prevents solving equilibria.
  • When full autonomy is required and human-in-the-loop delays are unacceptable.

Failure Modes

  • Agents learn and converge to a single game, losing the uncertainty that caused corrigibility.
  • Mis-specified beliefs lead agents to avoid asking humans and act incorrigibly.
  • Human advisers provide adversarial instructions, causing agents to ignore supervision.