Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
3
Why It Matters For Business
Structured multi-agent review with short metadata handoffs reduces misleading AI outputs and makes speculative content overt—useful for customer-facing assistants, regulated domains, and brand safety.
Summary TLDR
The authors build a 3-stage agent pipeline (plus a 4th KPI evaluator) that exchanges structured OVON JSON messages to detect and reframe hallucinated text. They run 310 prompts engineered to induce hallucinations. A KPI agent measures four new metrics (FCD, FGR, FDF, ECS) and computes a Total Hallucination Score (THS). Mean THS moves from -0.0049 (front-end) to -0.0456 (2nd reviewer) to -0.1396 (3rd reviewer), showing consistent reduction in hallucination indicators. The pipeline is a practical pattern: have reviewers insert disclaimers and send lightweight meta-information so later agents can target fixes without losing context.
Problem Statement
Generative LLMs often produce confident but false or speculative claims (hallucinations). The paper asks whether orchestrated, specialized agents exchanging natural-language JSON metadata can detect, flag, and reframe hallucinations reliably and measurably across many prompts.
Main Contribution
Design and run a 3-stage agent pipeline plus a 4th KPI-evaluator that uses OVON JSON messages to pass natural-language metadata between agents.
Introduce four practical hallucination KPIs: Factual Claim Density (FCD), Factual Grounding References (FGR), Fictional Disclaimer Frequency (FDF), and Explicit Contextualization Score (ECS).
Empirical test on 310 hallucination-inducing prompts showing progressive reductions in a computed Total Hallucination Score (THS) after each reviewer stage.
Key Findings
Multi-agent review reduced mean THS across 310 prompts from -0.004919 (front-end) to -0.139597 (third reviewer).
Second- and third-level reviewers progressively increase explicit disclaimers and contextual framing as measured by FDF and ECS.
The mitigation effect varies by prompt: some prompts with plausible grounding saw large THS declines while pure fantasy prompts saw modest declines (e.g., 700% vs 33%).
Results
Mean Total Hallucination Score (THS) - FrontEndAgent
Mean Total Hallucination Score (THS) - SecondLevelReviewer
Mean Total Hallucination Score (THS) - ThirdLevelReviewer
Who Should Care
What To Try In 7 Days
Add one automated reviewer agent that inserts explicit disclaimers for speculative responses.
Adopt a lightweight JSON whisper field (OVON-style) to pass a 1–2 sentence hallucination summary to downstream modules.
Compute simple counts: FCD and FDF per 100 words to monitor hallucination trends in your outputs.
Agent Features
Memory
- Short-term conversational context passed via JSON whisper fields
Planning
- Sequential review with role-specific instructions
Tool Use
- OVON conversation envelopes for handoffs
- LLM-based reviewer agents
Frameworks
- OVON
- Autogen
Is Agentic
true
Architectures
- OVON JSON-based multi-agent pipeline
- Autogen orchestration
Collaboration
- Natural-language metadata (utterance + whisper context/value) shared across agents
Optimization Features
System Optimization
- Use of concise JSON metadata to avoid reprocessing full context
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on proprietary LLMs; internal reasoning paths are opaque.
- KPIs are proxy indicators (stylistic and lexical) not ground-truth fact checks.
- Spot manual checks were limited; full human verification was not performed.
When Not To Use
- If you need authoritative fact verification backed by external sources rather than framing or disclaimers.
- When regulatory audits require traceable, source-backed claims instead of stylistic mitigation.
Failure Modes
- Reviewer agents may repeat or amplify false claims if metadata is ambiguous.
- Purely fictional prompts resist grounding; mitigation will mainly add disclaimers not facts.
- Very small THS baselines can produce large percentage changes that overstate practical impact.
Core Entities
Models
- gpt-3.5-turbo
- gpt-4o
- gpt-o1
Metrics
- FCD
- FGR
- FDF
- ECS
- THS
Datasets
- 310 hallucination-inducing prompts (authors' prompt set)
- pipeline_results_with_ths.csv
Context Entities
Models
- Llama3-70b
- GPT-4 variants

