Use multi-agent pipelines and OVON JSON handoffs to lower LLM hallucinations

January 19, 20257 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

3

Authors

Diego Gosmar, Deborah A. Dahl

Links

Abstract / PDF

Why It Matters For Business

Structured multi-agent review with short metadata handoffs reduces misleading AI outputs and makes speculative content overt—useful for customer-facing assistants, regulated domains, and brand safety.

Summary TLDR

The authors build a 3-stage agent pipeline (plus a 4th KPI evaluator) that exchanges structured OVON JSON messages to detect and reframe hallucinated text. They run 310 prompts engineered to induce hallucinations. A KPI agent measures four new metrics (FCD, FGR, FDF, ECS) and computes a Total Hallucination Score (THS). Mean THS moves from -0.0049 (front-end) to -0.0456 (2nd reviewer) to -0.1396 (3rd reviewer), showing consistent reduction in hallucination indicators. The pipeline is a practical pattern: have reviewers insert disclaimers and send lightweight meta-information so later agents can target fixes without losing context.

Problem Statement

Generative LLMs often produce confident but false or speculative claims (hallucinations). The paper asks whether orchestrated, specialized agents exchanging natural-language JSON metadata can detect, flag, and reframe hallucinations reliably and measurably across many prompts.

Main Contribution

Design and run a 3-stage agent pipeline plus a 4th KPI-evaluator that uses OVON JSON messages to pass natural-language metadata between agents.

Introduce four practical hallucination KPIs: Factual Claim Density (FCD), Factual Grounding References (FGR), Fictional Disclaimer Frequency (FDF), and Explicit Contextualization Score (ECS).

Empirical test on 310 hallucination-inducing prompts showing progressive reductions in a computed Total Hallucination Score (THS) after each reviewer stage.

Key Findings

Multi-agent review reduced mean THS across 310 prompts from -0.004919 (front-end) to -0.139597 (third reviewer).

NumbersTHS mean: -0.004919 -> -0.139597

Second- and third-level reviewers progressively increase explicit disclaimers and contextual framing as measured by FDF and ECS.

NumbersExample use case FDF: 0.1 -> 0.3; ECS: 0.1 -> 0.4

The mitigation effect varies by prompt: some prompts with plausible grounding saw large THS declines while pure fantasy prompts saw modest declines (e.g., 700% vs 33%).

NumbersPrompt 4 reduction ~700%; Prompt 56 reduction ~33%

Results

Mean Total Hallucination Score (THS) - FrontEndAgent

Value-0.004919

Mean Total Hallucination Score (THS) - SecondLevelReviewer

Value-0.045565

Baseline-0.004919 (THS1)

Mean Total Hallucination Score (THS) - ThirdLevelReviewer

Value-0.139597

Baseline-0.045565 (THS2)

Who Should Care

What To Try In 7 Days

Add one automated reviewer agent that inserts explicit disclaimers for speculative responses.

Adopt a lightweight JSON whisper field (OVON-style) to pass a 1–2 sentence hallucination summary to downstream modules.

Compute simple counts: FCD and FDF per 100 words to monitor hallucination trends in your outputs.

Agent Features

Memory

  • Short-term conversational context passed via JSON whisper fields

Planning

  • Sequential review with role-specific instructions

Tool Use

  • OVON conversation envelopes for handoffs
  • LLM-based reviewer agents

Frameworks

  • OVON
  • Autogen

Is Agentic

true

Architectures

  • OVON JSON-based multi-agent pipeline
  • Autogen orchestration

Collaboration

  • Natural-language metadata (utterance + whisper context/value) shared across agents

Optimization Features

System Optimization

  • Use of concise JSON metadata to avoid reprocessing full context

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on proprietary LLMs; internal reasoning paths are opaque.
  • KPIs are proxy indicators (stylistic and lexical) not ground-truth fact checks.
  • Spot manual checks were limited; full human verification was not performed.

When Not To Use

  • If you need authoritative fact verification backed by external sources rather than framing or disclaimers.
  • When regulatory audits require traceable, source-backed claims instead of stylistic mitigation.

Failure Modes

  • Reviewer agents may repeat or amplify false claims if metadata is ambiguous.
  • Purely fictional prompts resist grounding; mitigation will mainly add disclaimers not facts.
  • Very small THS baselines can produce large percentage changes that overstate practical impact.

Core Entities

Models

  • gpt-3.5-turbo
  • gpt-4o
  • gpt-o1

Metrics

  • FCD
  • FGR
  • FDF
  • ECS
  • THS

Datasets

  • 310 hallucination-inducing prompts (authors' prompt set)
  • pipeline_results_with_ths.csv

Context Entities

Models

  • Llama3-70b
  • GPT-4 variants