Overview
The idea is modular and useful, but the paper provides only automated, model-to-model evaluation; no human validation or deployed action-selection was implemented.
Citations0
Evidence Strength0.50
Confidence0.85
Risk Signals12
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/3
Reproducibility
Status: No open assets linked
Open source: No
At A Glance
Cost impact: 40%
Production readiness: 30%
Novelty: 60%
Why It Matters For Business
COMPASS lets firms check legal, ethical, and carbon constraints before an agent acts, lowering regulatory and reputational risk while keeping explainable records of why decisions were blocked or allowed.
Who Should Care
Summary TLDR
COMPASS is a modular orchestration layer that intercepts agent actions and routes them to four specialist sub-agents (Sovereignty, Carbon, Compliance, Ethics). Each sub-agent uses Retrieval-Augmented Generation (RAG) and an LLM-as-a-judge to produce scores, constraints, and short explanations. Automated tests show RAG changes judgments (notably +0.25 in many sovereignty cases and -0.25 in many compliance cases) and raises the semantic grounding of explanations (BERTScore ~75–85%). The system currently implements only evaluation and explainability; action-selection and human validation are left for future work.
Problem Statement
Agentic LLM systems make autonomous choices that can conflict with local law, energy targets, and ethical norms. Existing governance tools treat these dimensions separately or post-hoc. Practitioners need an explainable, real-time layer that checks actions across sovereignty, sustainability, compliance, and ethics before execution.
Main Contribution
Design of COMPASS: an Orchestrator plus four specialist sub-agents (Sovereignty, Carbon, Compliance, Ethics) that evaluate requests before action.
Integration of RAG per sub-agent so judgments are grounded in context-specific documents and local rules.
Key Findings
RAG changed Sovereignty judgments upward in multiple tests.
RAG often lowered Compliance scores for tested cases.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Sovereignty ∆ Score (with vs without RAG) | +0.25 (observed per test where noted) | Score without RAG | +0.25 in 5 of 10 SOV tests | Table 5 (SOV-01..SOV-10) | RAG raised several sovereignty scores from 0.25 to 0.5 or 0.5 to 0.75 | Table 5 |
| Compliance ∆ Score (with vs without RAG) | -0.25 (observed per test where noted) | Score without RAG | -0.25 in 5 of 10 COM tests | Table 7 (COM-01..COM-10) | RAG lowered some compliance scores (e.g., 0.50→0.25) | Table 7 |
What To Try In 7 Days
Run a lightweight RAG pipeline for one compliance rule and compare judge outputs with/without RAG.
Attach a small Orchestrator wrapper to an internal chatbot to emit per-request scores and short explanations.
Collect a short set of local regulation and policy documents into a vector DB for immediate RAG tests.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluation is automated only; no human-in-the-loop validation performed.
Action-selection and enforcement are conceptual and not implemented.
When Not To Use
Where real-time enforcement or automated blocking is required now (framework lacks action execution).
In high-stakes settings until human validation confirms judge reliability.
Failure Modes
Judge hallucinations when RAG is disabled or retrieval fails.
Conflicting sub-agent scores (e.g., sovereignty vs carbon) without a robust resolution policy.

