Overview
Production Readiness
0.3
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
COMPASS lets firms check legal, ethical, and carbon constraints before an agent acts, lowering regulatory and reputational risk while keeping explainable records of why decisions were blocked or allowed.
Summary TLDR
COMPASS is a modular orchestration layer that intercepts agent actions and routes them to four specialist sub-agents (Sovereignty, Carbon, Compliance, Ethics). Each sub-agent uses Retrieval-Augmented Generation (RAG) and an LLM-as-a-judge to produce scores, constraints, and short explanations. Automated tests show RAG changes judgments (notably +0.25 in many sovereignty cases and -0.25 in many compliance cases) and raises the semantic grounding of explanations (BERTScore ~75–85%). The system currently implements only evaluation and explainability; action-selection and human validation are left for future work.
Problem Statement
Agentic LLM systems make autonomous choices that can conflict with local law, energy targets, and ethical norms. Existing governance tools treat these dimensions separately or post-hoc. Practitioners need an explainable, real-time layer that checks actions across sovereignty, sustainability, compliance, and ethics before execution.
Main Contribution
Design of COMPASS: an Orchestrator plus four specialist sub-agents (Sovereignty, Carbon, Compliance, Ethics) that evaluate requests before action.
Integration of RAG per sub-agent so judgments are grounded in context-specific documents and local rules.
Use of an "LLM-as-a-judge" pipeline to emit numeric scores and short explainable justifications for each dimension.
Automated evaluation (no humans yet) showing RAG changes scores and improves semantic coherence of explanations (BERTScore).
A composition-based software pattern that forces governance checks via inheritance and modular interfaces.
Key Findings
RAG changed Sovereignty judgments upward in multiple tests.
RAG often lowered Compliance scores for tested cases.
Explanations between non-augmented and RAG-augmented judges show strong semantic similarity but clearer grounding with RAG.
Results
Sovereignty ∆ Score (with vs without RAG)
Compliance ∆ Score (with vs without RAG)
Explanation semantic similarity (BERTScore)
Who Should Care
What To Try In 7 Days
Run a lightweight RAG pipeline for one compliance rule and compare judge outputs with/without RAG.
Attach a small Orchestrator wrapper to an internal chatbot to emit per-request scores and short explanations.
Collect a short set of local regulation and policy documents into a vector DB for immediate RAG tests.
Agent Features
Memory
- Retrieval memory via dynamic document vector stores
Planning
- Decision synthesis (constraint aggregation and weighted scoring)
Tool Use
- Retrieval-Augmented Generation (RAG)
- Vector DB queries
Frameworks
- LLM-as-a-judge
- Local LLM instantiation (Mistral-7B config provided)
Is Agentic
true
Architectures
- Multi-agent Orchestration
- Composition-based OOP (Orchestrator + sub-agents)
Collaboration
- Synchronous sub-agent evaluation and conflict resolution
Optimization Features
Token Efficiency
- Conservative generation (max tokens 256) in judge prompts
Infra Optimization
- Reference to CodeCarbon and energy-intensity queries for carbon awareness
System Optimization
- Orchestrator prevents execution until thresholds satisfied (runtime gating)
Inference Optimization
- Carbon-aware inference monitoring (design concept)
Reproducibility
Open Source Status
- no
Risks & Boundaries
Limitations
- Evaluation is automated only; no human-in-the-loop validation performed.
- Action-selection and enforcement are conceptual and not implemented.
- RAG document curation and trustworthiness not systematically studied.
- LLM-as-judge bias and calibration are acknowledged but not mitigated here.
- No released code or datasets at time of writing.
When Not To Use
- Where real-time enforcement or automated blocking is required now (framework lacks action execution).
- In high-stakes settings until human validation confirms judge reliability.
- If you cannot supply trusted, up-to-date documents for RAG.
Failure Modes
- Judge hallucinations when RAG is disabled or retrieval fails.
- Conflicting sub-agent scores (e.g., sovereignty vs carbon) without a robust resolution policy.
- Poor document curation leading to incorrect grounding or outdated laws.
- Over-reliance on a single LLM judge may inherit its biases.
Core Entities
Models
- Mistral-7B-Instruct-v0.2
Metrics
- BERTScore
- Numeric score per dimension (0.0–1.0)
- ∆ Score (with vs without RAG)
Benchmarks
- Internal test set (SOV/CAR/COM/ETH test ids)

