Use multi-agent pipelines and OVON JSON handoffs to lower LLM hallucinations

January 19, 20257 min

Overview

Decision SnapshotNeeds Validation

The approach is a practical prototype: clear gains in KPI scores across 310 prompts, but results depend on prompt type and proprietary LLM behavior.

Citations3

Evidence Strength0.60

Confidence0.78

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 60%

Authors

Diego Gosmar, Deborah A. Dahl

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Structured multi-agent review with short metadata handoffs reduces misleading AI outputs and makes speculative content overt—useful for customer-facing assistants, regulated domains, and brand safety.

Who Should Care

Summary TLDR

The authors build a 3-stage agent pipeline (plus a 4th KPI evaluator) that exchanges structured OVON JSON messages to detect and reframe hallucinated text. They run 310 prompts engineered to induce hallucinations. A KPI agent measures four new metrics (FCD, FGR, FDF, ECS) and computes a Total Hallucination Score (THS). Mean THS moves from -0.0049 (front-end) to -0.0456 (2nd reviewer) to -0.1396 (3rd reviewer), showing consistent reduction in hallucination indicators. The pipeline is a practical pattern: have reviewers insert disclaimers and send lightweight meta-information so later agents can target fixes without losing context.

Problem Statement

Generative LLMs often produce confident but false or speculative claims (hallucinations). The paper asks whether orchestrated, specialized agents exchanging natural-language JSON metadata can detect, flag, and reframe hallucinations reliably and measurably across many prompts.

Main Contribution

Design and run a 3-stage agent pipeline plus a 4th KPI-evaluator that uses OVON JSON messages to pass natural-language metadata between agents.

Introduce four practical hallucination KPIs: Factual Claim Density (FCD), Factual Grounding References (FGR), Fictional Disclaimer Frequency (FDF), and Explicit Contextualization Score (ECS).

Key Findings

Multi-agent review reduced mean THS across 310 prompts from -0.004919 (front-end) to -0.139597 (third reviewer).

NumbersTHS mean: -0.004919 -> -0.139597

Practical UseAdding two reviewer agents that exchange OVON JSON metadata can measurably reduce hallucination indicators; implement reviewer stages to lower risky claims.

Evidence RefTable 2; Figure 2

Second- and third-level reviewers progressively increase explicit disclaimers and contextual framing as measured by FDF and ECS.

NumbersExample use case FDF: 0.1 -> 0.3; ECS: 0.1 -> 0.4

Practical UseHave reviewers insert clear, repeated disclaimers and contextual markers so downstream components and users can see what is speculative.

Evidence RefUse case (Section 6) and Table 1 preview

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Mean Total Hallucination Score (THS) - FrontEndAgent-0.004919310 promptsMean THS1 reported in Table 2Table 2
Mean Total Hallucination Score (THS) - SecondLevelReviewer-0.045565-0.004919 (THS1)-0.040646310 promptsMean THS2 reported in Table 2 (improvement over THS1)Table 2

What To Try In 7 Days

Add one automated reviewer agent that inserts explicit disclaimers for speculative responses.

Adopt a lightweight JSON whisper field (OVON-style) to pass a 1–2 sentence hallucination summary to downstream modules.

Compute simple counts: FCD and FDF per 100 words to monitor hallucination trends in your outputs.

Agent Features

Memory
Short-term conversational context passed via JSON whisper fields
Planning
Sequential review with role-specific instructions
Tool Use
OVON conversation envelopes for handoffsLLM-based reviewer agents
Frameworks
OVONAutogen
Is Agentic

Yes

Architectures
OVON JSON-based multi-agent pipelineAutogen orchestration
Collaboration
Natural-language metadata (utterance + whisper context/value) shared across agents

Optimization Features

System Optimization
Use of concise JSON metadata to avoid reprocessing full context

Reproducibility

Risks & Boundaries

Limitations

Relies on proprietary LLMs; internal reasoning paths are opaque.

KPIs are proxy indicators (stylistic and lexical) not ground-truth fact checks.

When Not To Use

If you need authoritative fact verification backed by external sources rather than framing or disclaimers.

When regulatory audits require traceable, source-backed claims instead of stylistic mitigation.

Failure Modes

Reviewer agents may repeat or amplify false claims if metadata is ambiguous.

Purely fictional prompts resist grounding; mitigation will mainly add disclaimers not facts.

Core Entities

Models

gpt-3.5-turbogpt-4ogpt-o1

Metrics

FCDFGRFDFECSTHS

Datasets

310 hallucination-inducing prompts (authors' prompt set)pipeline_results_with_ths.csv

Context Entities

Models

Llama3-70bGPT-4 variants