COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Overview

Decision SnapshotNeeds Validation

The idea is modular and useful, but the paper provides only automated, model-to-model evaluation; no human validation or deployed action-selection was implemented.

Citations0

Evidence Strength0.50

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Jean-Sébastien Dessureault, Alain-Thierry Iliho Manzi, Soukaina Alaoui Ismaili, Khadim Lo, Mireille Lalancette, Éric Bélanger

Links

Abstract / PDF

Why It Matters For Business

COMPASS lets firms check legal, ethical, and carbon constraints before an agent acts, lowering regulatory and reputational risk while keeping explainable records of why decisions were blocked or allowed.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

COMPASS is a modular orchestration layer that intercepts agent actions and routes them to four specialist sub-agents (Sovereignty, Carbon, Compliance, Ethics). Each sub-agent uses Retrieval-Augmented Generation (RAG) and an LLM-as-a-judge to produce scores, constraints, and short explanations. Automated tests show RAG changes judgments (notably +0.25 in many sovereignty cases and -0.25 in many compliance cases) and raises the semantic grounding of explanations (BERTScore ~75–85%). The system currently implements only evaluation and explainability; action-selection and human validation are left for future work.

Problem Statement

Agentic LLM systems make autonomous choices that can conflict with local law, energy targets, and ethical norms. Existing governance tools treat these dimensions separately or post-hoc. Practitioners need an explainable, real-time layer that checks actions across sovereignty, sustainability, compliance, and ethics before execution.

Main Contribution

Design of COMPASS: an Orchestrator plus four specialist sub-agents (Sovereignty, Carbon, Compliance, Ethics) that evaluate requests before action.

Integration of RAG per sub-agent so judgments are grounded in context-specific documents and local rules.

Key Findings

RAG changed Sovereignty judgments upward in multiple tests.

Numbers∆ Score = +0.25 in 5 of 10 SOV tests (e.g., SOV-01, SOV-06, SOV-07, SOV-08, SOV-10)

Practical UseAdd RAG with local documents to markedly shift sovereignty assessments toward higher compliance in many cases.

Evidence RefTable 5

RAG often lowered Compliance scores for tested cases.

Numbers∆ Score = -0.25 in 5 of 10 COM tests (COM-01, COM-02, COM-05, COM-07, COM-10)

Practical UseGrounding evaluations in regulations can reveal compliance gaps and reduce false-positive approvals from ungrounded judges.

Evidence RefTable 7

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Sovereignty ∆ Score (with vs without RAG)	+0.25 (observed per test where noted)	Score without RAG	+0.25 in 5 of 10 SOV tests	Table 5 (SOV-01..SOV-10)	RAG raised several sovereignty scores from 0.25 to 0.5 or 0.5 to 0.75	Table 5
Compliance ∆ Score (with vs without RAG)	-0.25 (observed per test where noted)	Score without RAG	-0.25 in 5 of 10 COM tests	Table 7 (COM-01..COM-10)	RAG lowered some compliance scores (e.g., 0.50→0.25)	Table 7

What To Try In 7 Days

Run a lightweight RAG pipeline for one compliance rule and compare judge outputs with/without RAG.

Attach a small Orchestrator wrapper to an internal chatbot to emit per-request scores and short explanations.

Collect a short set of local regulation and policy documents into a vector DB for immediate RAG tests.

Agent Features

Memory

Retrieval memory via dynamic document vector stores

Planning

Decision synthesis (constraint aggregation and weighted scoring)

Tool Use

Retrieval-Augmented Generation (RAG)Vector DB queries

Frameworks

LLM-as-a-judgeLocal LLM instantiation (Mistral-7B config provided)

Is Agentic

Yes

Architectures

Multi-agent OrchestrationComposition-based OOP (Orchestrator + sub-agents)

Collaboration

Synchronous sub-agent evaluation and conflict resolution

Optimization Features

Token Efficiency

Conservative generation (max tokens 256) in judge prompts

Infra Optimization

Reference to CodeCarbon and energy-intensity queries for carbon awareness

System Optimization

Orchestrator prevents execution until thresholds satisfied (runtime gating)

Inference Optimization

Carbon-aware inference monitoring (design concept)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusNo

LicenseUnknown

Risks & Boundaries

Limitations

Evaluation is automated only; no human-in-the-loop validation performed.

Action-selection and enforcement are conceptual and not implemented.

When Not To Use

Where real-time enforcement or automated blocking is required now (framework lacks action execution).

In high-stakes settings until human validation confirms judge reliability.

Failure Modes

Judge hallucinations when RAG is disabled or retrieval fails.

Conflicting sub-agent scores (e.g., sovereignty vs carbon) without a robust resolution policy.

Core Entities

Models

Mistral-7B-Instruct-v0.2

Metrics

BERTScoreNumeric score per dimension (0.0–1.0)∆ Score (with vs without RAG)

Benchmarks

Internal test set (SOV/CAR/COM/ETH test ids)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RAG changed Sovereignty judgments upward in multiple tests.

RAG often lowered Compliance scores for tested cases.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Benchmarks

You May Also Want to Read

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

RAPS: intent-driven, reputation-aware publish–subscribe for adaptive multi-agent LLM coordination

Key finding

ACP: a layered, federated protocol for secure cross-platform agent-to-agent collaboration

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding