COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

March 11, 20267 min

Overview

Decision SnapshotNeeds Validation

The idea is modular and useful, but the paper provides only automated, model-to-model evaluation; no human validation or deployed action-selection was implemented.

Citations0

Evidence Strength0.50

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Jean-Sébastien Dessureault, Alain-Thierry Iliho Manzi, Soukaina Alaoui Ismaili, Khadim Lo, Mireille Lalancette, Éric Bélanger

Links

Abstract / PDF

Why It Matters For Business

COMPASS lets firms check legal, ethical, and carbon constraints before an agent acts, lowering regulatory and reputational risk while keeping explainable records of why decisions were blocked or allowed.

Who Should Care

Summary TLDR

COMPASS is a modular orchestration layer that intercepts agent actions and routes them to four specialist sub-agents (Sovereignty, Carbon, Compliance, Ethics). Each sub-agent uses Retrieval-Augmented Generation (RAG) and an LLM-as-a-judge to produce scores, constraints, and short explanations. Automated tests show RAG changes judgments (notably +0.25 in many sovereignty cases and -0.25 in many compliance cases) and raises the semantic grounding of explanations (BERTScore ~75–85%). The system currently implements only evaluation and explainability; action-selection and human validation are left for future work.

Problem Statement

Agentic LLM systems make autonomous choices that can conflict with local law, energy targets, and ethical norms. Existing governance tools treat these dimensions separately or post-hoc. Practitioners need an explainable, real-time layer that checks actions across sovereignty, sustainability, compliance, and ethics before execution.

Main Contribution

Design of COMPASS: an Orchestrator plus four specialist sub-agents (Sovereignty, Carbon, Compliance, Ethics) that evaluate requests before action.

Integration of RAG per sub-agent so judgments are grounded in context-specific documents and local rules.

Key Findings

RAG changed Sovereignty judgments upward in multiple tests.

Numbers∆ Score = +0.25 in 5 of 10 SOV tests (e.g., SOV-01, SOV-06, SOV-07, SOV-08, SOV-10)

Practical UseAdd RAG with local documents to markedly shift sovereignty assessments toward higher compliance in many cases.

Evidence RefTable 5

RAG often lowered Compliance scores for tested cases.

Numbers∆ Score = -0.25 in 5 of 10 COM tests (COM-01, COM-02, COM-05, COM-07, COM-10)

Practical UseGrounding evaluations in regulations can reveal compliance gaps and reduce false-positive approvals from ungrounded judges.

Evidence RefTable 7

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Sovereignty ∆ Score (with vs without RAG)+0.25 (observed per test where noted)Score without RAG+0.25 in 5 of 10 SOV testsTable 5 (SOV-01..SOV-10)RAG raised several sovereignty scores from 0.25 to 0.5 or 0.5 to 0.75Table 5
Compliance ∆ Score (with vs without RAG)-0.25 (observed per test where noted)Score without RAG-0.25 in 5 of 10 COM testsTable 7 (COM-01..COM-10)RAG lowered some compliance scores (e.g., 0.50→0.25)Table 7

What To Try In 7 Days

Run a lightweight RAG pipeline for one compliance rule and compare judge outputs with/without RAG.

Attach a small Orchestrator wrapper to an internal chatbot to emit per-request scores and short explanations.

Collect a short set of local regulation and policy documents into a vector DB for immediate RAG tests.

Agent Features

Memory
Retrieval memory via dynamic document vector stores
Planning
Decision synthesis (constraint aggregation and weighted scoring)
Tool Use
Retrieval-Augmented Generation (RAG)Vector DB queries
Frameworks
LLM-as-a-judgeLocal LLM instantiation (Mistral-7B config provided)
Is Agentic

Yes

Architectures
Multi-agent OrchestrationComposition-based OOP (Orchestrator + sub-agents)
Collaboration
Synchronous sub-agent evaluation and conflict resolution

Optimization Features

Token Efficiency
Conservative generation (max tokens 256) in judge prompts
Infra Optimization
Reference to CodeCarbon and energy-intensity queries for carbon awareness
System Optimization
Orchestrator prevents execution until thresholds satisfied (runtime gating)
Inference Optimization
Carbon-aware inference monitoring (design concept)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation is automated only; no human-in-the-loop validation performed.

Action-selection and enforcement are conceptual and not implemented.

When Not To Use

Where real-time enforcement or automated blocking is required now (framework lacks action execution).

In high-stakes settings until human validation confirms judge reliability.

Failure Modes

Judge hallucinations when RAG is disabled or retrieval fails.

Conflicting sub-agent scores (e.g., sovereignty vs carbon) without a robust resolution policy.

Core Entities

Models

Mistral-7B-Instruct-v0.2

Metrics

BERTScoreNumeric score per dimension (0.0–1.0)∆ Score (with vs without RAG)

Benchmarks

Internal test set (SOV/CAR/COM/ETH test ids)