MI9: a runtime governance layer that monitors and intervenes in agentic AI behavior

August 5, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.55

Cost Impact Score

0.6

Citation Count

0

Authors

Charles L. Wang, Trisha Singhal, Ameya Kelkar, Jason Tuo

Links

Abstract / PDF

Why It Matters For Business

MI9 turns opaque agent decisions into actionable runtime controls, reducing undetected risky behaviors while keeping false alarms low, which helps prevent costly operational and compliance incidents.

Summary TLDR

MI9 is a vendor-neutral runtime governance layer for agentic AI. It defines six coordinated components—an Agency-Risk Index (ARI), an agent-semantic telemetry schema (ATS), continuous authorization monitoring (CAM), an FSM-based conformance engine, goal-conditioned drift detection, and a four-level graduated containment system—to observe, score, and intervene in live agent behaviors. Evaluated on 1,033 synthetic multi-domain scenarios, MI9 reports a 99.81% detection rate and very low false positives, but results are synthetic and depend on full instrumentation of agent internals. The authors open-source prompts and scripts for their evaluation.

Problem Statement

Agentic AI (models that plan, revise goals, use tools, and spawn subagents) can develop risky behaviors only visible during execution. Pre-deployment controls and infra-level monitoring miss cognitive events like goal changes, tool-chaining, and delegation. Organizations need a runtime layer that observes agent semantics and can intervene without breaking ongoing operations.

Main Contribution

A unified runtime governance framework (MI9) integrating telemetry, authorization, conformance, drift detection, and containment.

Agency-Risk Index (ARI) to tier agents by autonomy, adaptability, and continuity.

Agentic Telemetry Schema (ATS) for cognitive, action, and coordination events.

Continuous Authorization Monitoring (CAM) that revokes/adjusts permissions based on agent context and delegation provenance.

Real-time FSM conformance engine for sequence-and-time-based policy enforcement.

Goal-conditioned drift detection to separate legitimate adaptation from suspicious change.

Graduated containment with four levels (monitor → planning restriction → tool restriction → isolation) to preserve continuity.

Synthetic evaluation across 1,033 scenarios and open-sourced prompts/scripts for reproducibility.

Key Findings

MI9 detects nearly all simulated governance violations on evaluated traces.

NumbersDetection Rate 99.81% (MI9) vs 93.98% (OT) vs 68.52% (LS)

MI9 achieves very low false alarms while keeping high detection.

NumbersFalse Positive Rate 0.0121% (MI9) vs 0.0242% (OT)

MI9 provides stronger actionable intelligence: clearer causes and earlier warnings.

NumbersCausal clarity 0.822, Predictive alerting 0.672, Proactive intervention 0.578

Evaluation is synthetic and depends on full agent instrumentation.

NumbersEvaluation over 1,033 synthetic scenarios

Results

Detection Rate

Value99.81%

BaselineOT 93.98%; LS 68.52%

False Positive Rate

Value0.0121%

BaselineOT 0.0242%

Risk Coverage Rate

Value94.41%

BaselineOT 84.44%; LS 60.46%

Causal Chain Clarity

Value0.822

BaselineOT 0.448; LS 0.348

Predictive Alerting

Value0.672

BaselineOT 0.341; LS 0.107

Proactive Intervention

Value0.578

BaselineOT 0.116; LS 0.02

Who Should Care

What To Try In 7 Days

Instrument one agent to emit ATS events (cognitive, action, coordination).

Compute ARI for that agent class and set a governance tier.

Write 2–3 FSM rules for high-risk sequences (e.g., approval before execution) and test on replayed traces.

Agent Features

Memory

  • Continuity scoring (session and persistent memory)
  • Goal-conditioned baselines for drift

Planning

  • Goal-aware authorization
  • Pattern-aware conformance rules
  • Planning intervention containment

Tool Use

  • Tool restriction and substitution
  • Delegation chain monitoring

Frameworks

  • Adapter pattern for LangChain, LangSmith, OpenAI SDK, CrewAI

Is Agentic

true

Architectures

  • Framework-layer governance
  • Adapter-based telemetry integration

Collaboration

  • Coordination event capture
  • Multi-agent joint containment

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation uses synthetic LLM-generated traces, not live production logs.
  • Effectiveness requires rich agent instrumentation; opaque APIs create blind spots.
  • Runtime monitoring adds computational and operational overhead that must be tuned for throughput.
  • MI9's governance layer itself increases attack surface and requires adversarial testing.

When Not To Use

  • You cannot instrument agent internals to emit ATS events.
  • Ultra-low-latency systems where any runtime intervention would break correctness.
  • Environments that forbid pausing or isolating agents due to real-time safety constraints.

Failure Modes

  • False negatives when causally related events reorder beyond buffering windows.
  • Blind spots for agents implemented via opaque third-party APIs.
  • Adversaries manipulating declared goals to evade goal-conditioned baselines.
  • Mis-specified FSM patterns causing missed multi-step violations or excessive escalation.

Core Entities

Models

  • Gemini 2.5-flash
  • Gemini 2.5-pro

Metrics

  • Detection Rate
  • False Positive Rate
  • Risk Coverage Rate
  • Causal Chain Clarity
  • Predictive Alerting
  • Proactive Intervention
  • Governance Maturity

Datasets

  • Synthetic agent trace corpus (1,033 scenarios)

Context Entities

Models

  • Gemini 2.5-flash (eval settings)
  • Gemini 2.5-pro (generation)

Metrics

  • Detection/FPR/Intervention/Clarity/Predictive at chosen operating point

Datasets

  • Prompt-conditioned synthetic traces (released prompts/scripts)