Overview
The approach is a practical prototype: clear gains in KPI scores across 310 prompts, but results depend on prompt type and proprietary LLM behavior.
Citations3
Evidence Strength0.60
Confidence0.78
Risk Signals8
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 2/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
Structured multi-agent review with short metadata handoffs reduces misleading AI outputs and makes speculative content overt—useful for customer-facing assistants, regulated domains, and brand safety.
Who Should Care
Summary TLDR
The authors build a 3-stage agent pipeline (plus a 4th KPI evaluator) that exchanges structured OVON JSON messages to detect and reframe hallucinated text. They run 310 prompts engineered to induce hallucinations. A KPI agent measures four new metrics (FCD, FGR, FDF, ECS) and computes a Total Hallucination Score (THS). Mean THS moves from -0.0049 (front-end) to -0.0456 (2nd reviewer) to -0.1396 (3rd reviewer), showing consistent reduction in hallucination indicators. The pipeline is a practical pattern: have reviewers insert disclaimers and send lightweight meta-information so later agents can target fixes without losing context.
Problem Statement
Generative LLMs often produce confident but false or speculative claims (hallucinations). The paper asks whether orchestrated, specialized agents exchanging natural-language JSON metadata can detect, flag, and reframe hallucinations reliably and measurably across many prompts.
Main Contribution
Design and run a 3-stage agent pipeline plus a 4th KPI-evaluator that uses OVON JSON messages to pass natural-language metadata between agents.
Introduce four practical hallucination KPIs: Factual Claim Density (FCD), Factual Grounding References (FGR), Fictional Disclaimer Frequency (FDF), and Explicit Contextualization Score (ECS).
Key Findings
Multi-agent review reduced mean THS across 310 prompts from -0.004919 (front-end) to -0.139597 (third reviewer).
Second- and third-level reviewers progressively increase explicit disclaimers and contextual framing as measured by FDF and ECS.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Mean Total Hallucination Score (THS) - FrontEndAgent | -0.004919 | — | — | 310 prompts | Mean THS1 reported in Table 2 | Table 2 |
| Mean Total Hallucination Score (THS) - SecondLevelReviewer | -0.045565 | -0.004919 (THS1) | -0.040646 | 310 prompts | Mean THS2 reported in Table 2 (improvement over THS1) | Table 2 |
What To Try In 7 Days
Add one automated reviewer agent that inserts explicit disclaimers for speculative responses.
Adopt a lightweight JSON whisper field (OVON-style) to pass a 1–2 sentence hallucination summary to downstream modules.
Compute simple counts: FCD and FDF per 100 words to monitor hallucination trends in your outputs.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
System Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Relies on proprietary LLMs; internal reasoning paths are opaque.
KPIs are proxy indicators (stylistic and lexical) not ground-truth fact checks.
When Not To Use
If you need authoritative fact verification backed by external sources rather than framing or disclaimers.
When regulatory audits require traceable, source-backed claims instead of stylistic mitigation.
Failure Modes
Reviewer agents may repeat or amplify false claims if metadata is ambiguous.
Purely fictional prompts resist grounding; mitigation will mainly add disclaimers not facts.

