Use multi-agent pipelines and OVON JSON handoffs to lower LLM hallucinations

Overview

Decision SnapshotNeeds Validation

The approach is a practical prototype: clear gains in KPI scores across 310 prompts, but results depend on prompt type and proprietary LLM behavior.

Citations3

Evidence Strength0.60

Confidence0.78

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 50%

Novelty: 60%

Authors

Diego Gosmar, Deborah A. Dahl

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Structured multi-agent review with short metadata handoffs reduces misleading AI outputs and makes speculative content overt—useful for customer-facing assistants, regulated domains, and brand safety.

Who Should Care

Product Manager CTO ML Engineer Data Scientist Engineering Lead

Summary TLDR

The authors build a 3-stage agent pipeline (plus a 4th KPI evaluator) that exchanges structured OVON JSON messages to detect and reframe hallucinated text. They run 310 prompts engineered to induce hallucinations. A KPI agent measures four new metrics (FCD, FGR, FDF, ECS) and computes a Total Hallucination Score (THS). Mean THS moves from -0.0049 (front-end) to -0.0456 (2nd reviewer) to -0.1396 (3rd reviewer), showing consistent reduction in hallucination indicators. The pipeline is a practical pattern: have reviewers insert disclaimers and send lightweight meta-information so later agents can target fixes without losing context.

Problem Statement

Generative LLMs often produce confident but false or speculative claims (hallucinations). The paper asks whether orchestrated, specialized agents exchanging natural-language JSON metadata can detect, flag, and reframe hallucinations reliably and measurably across many prompts.

Main Contribution

Design and run a 3-stage agent pipeline plus a 4th KPI-evaluator that uses OVON JSON messages to pass natural-language metadata between agents.

Introduce four practical hallucination KPIs: Factual Claim Density (FCD), Factual Grounding References (FGR), Fictional Disclaimer Frequency (FDF), and Explicit Contextualization Score (ECS).

Key Findings

Multi-agent review reduced mean THS across 310 prompts from -0.004919 (front-end) to -0.139597 (third reviewer).

NumbersTHS mean: -0.004919 -> -0.139597

Practical UseAdding two reviewer agents that exchange OVON JSON metadata can measurably reduce hallucination indicators; implement reviewer stages to lower risky claims.

Evidence RefTable 2; Figure 2

Second- and third-level reviewers progressively increase explicit disclaimers and contextual framing as measured by FDF and ECS.

NumbersExample use case FDF: 0.1 -> 0.3; ECS: 0.1 -> 0.4

Practical UseHave reviewers insert clear, repeated disclaimers and contextual markers so downstream components and users can see what is speculative.

Evidence RefUse case (Section 6) and Table 1 preview

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Mean Total Hallucination Score (THS) - FrontEndAgent	-0.004919	—	—	310 prompts	Mean THS1 reported in Table 2	Table 2
Mean Total Hallucination Score (THS) - SecondLevelReviewer	-0.045565	-0.004919 (THS1)	-0.040646	310 prompts	Mean THS2 reported in Table 2 (improvement over THS1)	Table 2

What To Try In 7 Days

Add one automated reviewer agent that inserts explicit disclaimers for speculative responses.

Adopt a lightweight JSON whisper field (OVON-style) to pass a 1–2 sentence hallucination summary to downstream modules.

Compute simple counts: FCD and FDF per 100 words to monitor hallucination trends in your outputs.

Agent Features

Memory

Short-term conversational context passed via JSON whisper fields

Planning

Sequential review with role-specific instructions

Tool Use

OVON conversation envelopes for handoffsLLM-based reviewer agents

Frameworks

OVONAutogen

Is Agentic

Yes

Architectures

OVON JSON-based multi-agent pipelineAutogen orchestration

Collaboration

Natural-language metadata (utterance + whisper context/value) shared across agents

Optimization Features

System Optimization

Use of concise JSON metadata to avoid reprocessing full context

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/diegogosmar/hall_evaluator https://github.com/diegogosmar/hall_evaluator/blob/main/pipeline_results_with_ths.csv

Data URLs

https://github.com/diegogosmar/hall_evaluator/blob/main/pipeline_results_with_ths.csv

Risks & Boundaries

Limitations

Relies on proprietary LLMs; internal reasoning paths are opaque.

KPIs are proxy indicators (stylistic and lexical) not ground-truth fact checks.

When Not To Use

If you need authoritative fact verification backed by external sources rather than framing or disclaimers.

When regulatory audits require traceable, source-backed claims instead of stylistic mitigation.

Failure Modes

Reviewer agents may repeat or amplify false claims if metadata is ambiguous.

Purely fictional prompts resist grounding; mitigation will mainly add disclaimers not facts.

Core Entities

Models

gpt-3.5-turbogpt-4ogpt-o1

Metrics

FCDFGRFDFECSTHS

Datasets

310 hallucination-inducing prompts (authors' prompt set)pipeline_results_with_ths.csv

Context Entities

Models

Llama3-70bGPT-4 variants

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Multi-agent review reduced mean THS across 310 prompts from -0.004919 (front-end) to -0.139597 (third reviewer).

Second- and third-level reviewers progressively increase explicit disclaimers and contextual framing as measured by FDF and ECS.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

You May Also Want to Read

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

TrustBench: a runtime safety gate for agents that cuts harmful actions and runs in under 200 ms

Key finding

ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Key finding