Overview
Production Readiness
0.6
Novelty Score
0.55
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
MI9 turns opaque agent decisions into actionable runtime controls, reducing undetected risky behaviors while keeping false alarms low, which helps prevent costly operational and compliance incidents.
Summary TLDR
MI9 is a vendor-neutral runtime governance layer for agentic AI. It defines six coordinated components—an Agency-Risk Index (ARI), an agent-semantic telemetry schema (ATS), continuous authorization monitoring (CAM), an FSM-based conformance engine, goal-conditioned drift detection, and a four-level graduated containment system—to observe, score, and intervene in live agent behaviors. Evaluated on 1,033 synthetic multi-domain scenarios, MI9 reports a 99.81% detection rate and very low false positives, but results are synthetic and depend on full instrumentation of agent internals. The authors open-source prompts and scripts for their evaluation.
Problem Statement
Agentic AI (models that plan, revise goals, use tools, and spawn subagents) can develop risky behaviors only visible during execution. Pre-deployment controls and infra-level monitoring miss cognitive events like goal changes, tool-chaining, and delegation. Organizations need a runtime layer that observes agent semantics and can intervene without breaking ongoing operations.
Main Contribution
A unified runtime governance framework (MI9) integrating telemetry, authorization, conformance, drift detection, and containment.
Agency-Risk Index (ARI) to tier agents by autonomy, adaptability, and continuity.
Agentic Telemetry Schema (ATS) for cognitive, action, and coordination events.
Continuous Authorization Monitoring (CAM) that revokes/adjusts permissions based on agent context and delegation provenance.
Real-time FSM conformance engine for sequence-and-time-based policy enforcement.
Goal-conditioned drift detection to separate legitimate adaptation from suspicious change.
Graduated containment with four levels (monitor → planning restriction → tool restriction → isolation) to preserve continuity.
Synthetic evaluation across 1,033 scenarios and open-sourced prompts/scripts for reproducibility.
Key Findings
MI9 detects nearly all simulated governance violations on evaluated traces.
MI9 achieves very low false alarms while keeping high detection.
MI9 provides stronger actionable intelligence: clearer causes and earlier warnings.
Evaluation is synthetic and depends on full agent instrumentation.
Results
Detection Rate
False Positive Rate
Risk Coverage Rate
Causal Chain Clarity
Predictive Alerting
Proactive Intervention
Who Should Care
What To Try In 7 Days
Instrument one agent to emit ATS events (cognitive, action, coordination).
Compute ARI for that agent class and set a governance tier.
Write 2–3 FSM rules for high-risk sequences (e.g., approval before execution) and test on replayed traces.
Agent Features
Memory
- Continuity scoring (session and persistent memory)
- Goal-conditioned baselines for drift
Planning
- Goal-aware authorization
- Pattern-aware conformance rules
- Planning intervention containment
Tool Use
- Tool restriction and substitution
- Delegation chain monitoring
Frameworks
- Adapter pattern for LangChain, LangSmith, OpenAI SDK, CrewAI
Is Agentic
true
Architectures
- Framework-layer governance
- Adapter-based telemetry integration
Collaboration
- Coordination event capture
- Multi-agent joint containment
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation uses synthetic LLM-generated traces, not live production logs.
- Effectiveness requires rich agent instrumentation; opaque APIs create blind spots.
- Runtime monitoring adds computational and operational overhead that must be tuned for throughput.
- MI9's governance layer itself increases attack surface and requires adversarial testing.
When Not To Use
- You cannot instrument agent internals to emit ATS events.
- Ultra-low-latency systems where any runtime intervention would break correctness.
- Environments that forbid pausing or isolating agents due to real-time safety constraints.
Failure Modes
- False negatives when causally related events reorder beyond buffering windows.
- Blind spots for agents implemented via opaque third-party APIs.
- Adversaries manipulating declared goals to evade goal-conditioned baselines.
- Mis-specified FSM patterns causing missed multi-step violations or excessive escalation.
Core Entities
Models
- Gemini 2.5-flash
- Gemini 2.5-pro
Metrics
- Detection Rate
- False Positive Rate
- Risk Coverage Rate
- Causal Chain Clarity
- Predictive Alerting
- Proactive Intervention
- Governance Maturity
Datasets
- Synthetic agent trace corpus (1,033 scenarios)
Context Entities
Models
- Gemini 2.5-flash (eval settings)
- Gemini 2.5-pro (generation)
Metrics
- Detection/FPR/Intervention/Clarity/Predictive at chosen operating point
Datasets
- Prompt-conditioned synthetic traces (released prompts/scripts)

