COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

March 11, 20267 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

Jean-Sébastien Dessureault, Alain-Thierry Iliho Manzi, Soukaina Alaoui Ismaili, Khadim Lo, Mireille Lalancette, Éric Bélanger

Links

Abstract / PDF

Why It Matters For Business

COMPASS lets firms check legal, ethical, and carbon constraints before an agent acts, lowering regulatory and reputational risk while keeping explainable records of why decisions were blocked or allowed.

Summary TLDR

COMPASS is a modular orchestration layer that intercepts agent actions and routes them to four specialist sub-agents (Sovereignty, Carbon, Compliance, Ethics). Each sub-agent uses Retrieval-Augmented Generation (RAG) and an LLM-as-a-judge to produce scores, constraints, and short explanations. Automated tests show RAG changes judgments (notably +0.25 in many sovereignty cases and -0.25 in many compliance cases) and raises the semantic grounding of explanations (BERTScore ~75–85%). The system currently implements only evaluation and explainability; action-selection and human validation are left for future work.

Problem Statement

Agentic LLM systems make autonomous choices that can conflict with local law, energy targets, and ethical norms. Existing governance tools treat these dimensions separately or post-hoc. Practitioners need an explainable, real-time layer that checks actions across sovereignty, sustainability, compliance, and ethics before execution.

Main Contribution

Design of COMPASS: an Orchestrator plus four specialist sub-agents (Sovereignty, Carbon, Compliance, Ethics) that evaluate requests before action.

Integration of RAG per sub-agent so judgments are grounded in context-specific documents and local rules.

Use of an "LLM-as-a-judge" pipeline to emit numeric scores and short explainable justifications for each dimension.

Automated evaluation (no humans yet) showing RAG changes scores and improves semantic coherence of explanations (BERTScore).

A composition-based software pattern that forces governance checks via inheritance and modular interfaces.

Key Findings

RAG changed Sovereignty judgments upward in multiple tests.

Numbers∆ Score = +0.25 in 5 of 10 SOV tests (e.g., SOV-01, SOV-06, SOV-07, SOV-08, SOV-10)

RAG often lowered Compliance scores for tested cases.

Numbers∆ Score = -0.25 in 5 of 10 COM tests (COM-01, COM-02, COM-05, COM-07, COM-10)

Explanations between non-augmented and RAG-augmented judges show strong semantic similarity but clearer grounding with RAG.

NumbersBERTScore similarity values mostly 75%–85% across tables; minima ≈66%, maxima ≈90%

Results

Sovereignty ∆ Score (with vs without RAG)

Value+0.25 (observed per test where noted)

BaselineScore without RAG

Compliance ∆ Score (with vs without RAG)

Value-0.25 (observed per test where noted)

BaselineScore without RAG

Explanation semantic similarity (BERTScore)

Value≈75%–85% typical

Baselinen/a (compares non-RAG vs RAG explanations)

Who Should Care

What To Try In 7 Days

Run a lightweight RAG pipeline for one compliance rule and compare judge outputs with/without RAG.

Attach a small Orchestrator wrapper to an internal chatbot to emit per-request scores and short explanations.

Collect a short set of local regulation and policy documents into a vector DB for immediate RAG tests.

Agent Features

Memory

  • Retrieval memory via dynamic document vector stores

Planning

  • Decision synthesis (constraint aggregation and weighted scoring)

Tool Use

  • Retrieval-Augmented Generation (RAG)
  • Vector DB queries

Frameworks

  • LLM-as-a-judge
  • Local LLM instantiation (Mistral-7B config provided)

Is Agentic

true

Architectures

  • Multi-agent Orchestration
  • Composition-based OOP (Orchestrator + sub-agents)

Collaboration

  • Synchronous sub-agent evaluation and conflict resolution

Optimization Features

Token Efficiency

  • Conservative generation (max tokens 256) in judge prompts

Infra Optimization

  • Reference to CodeCarbon and energy-intensity queries for carbon awareness

System Optimization

  • Orchestrator prevents execution until thresholds satisfied (runtime gating)

Inference Optimization

  • Carbon-aware inference monitoring (design concept)

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Evaluation is automated only; no human-in-the-loop validation performed.
  • Action-selection and enforcement are conceptual and not implemented.
  • RAG document curation and trustworthiness not systematically studied.
  • LLM-as-judge bias and calibration are acknowledged but not mitigated here.
  • No released code or datasets at time of writing.

When Not To Use

  • Where real-time enforcement or automated blocking is required now (framework lacks action execution).
  • In high-stakes settings until human validation confirms judge reliability.
  • If you cannot supply trusted, up-to-date documents for RAG.

Failure Modes

  • Judge hallucinations when RAG is disabled or retrieval fails.
  • Conflicting sub-agent scores (e.g., sovereignty vs carbon) without a robust resolution policy.
  • Poor document curation leading to incorrect grounding or outdated laws.
  • Over-reliance on a single LLM judge may inherit its biases.

Core Entities

Models

  • Mistral-7B-Instruct-v0.2

Metrics

  • BERTScore
  • Numeric score per dimension (0.0–1.0)
  • ∆ Score (with vs without RAG)

Benchmarks

  • Internal test set (SOV/CAR/COM/ETH test ids)