Let two agents use different retrieval tools and iteratively query the web to cut hallucinations in fact-checking

Overview

Decision SnapshotNeeds Validation

The method is conceptually straightforward and effective on benchmarks, but it increases latency and API costs due to multiple LLM calls and external retrievals.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Seyeon Jeong, Yeonjun Choi, JongWook Kim, Beakcheol Jang

Links

Abstract / PDF / Data

Why It Matters For Business

If your product relies on factual outputs from LLMs, combining complementary retrieval tools and round‑wise grounding checks reduces hallucinations and raises verification accuracy with moderate engineering cost.

Who Should Care

ML Engineer Engineering Lead Product Manager Data Scientist CTO

Summary TLDR

Tool-MAD is a multi-agent debate system where two agents use different external tools (a RAG module over a static corpus and a live web search API) and iteratively rewrite queries during a multi-round debate. Each answer is scored for faithfulness (grounding in retrieved documents) and answer relevance; a Judge agent uses these stability scores plus debate history to pick the final label. On four fact‑verification benchmarks and two medical QA sets, Tool‑MAD improves exact-match accuracy versus prior debate systems (reported up to +35% vs MAD and +5.5% vs MADKE on evaluated datasets). The framework is most helpful when evidence sources are complementary, query rewriting is enabled, and three

Problem Statement

Large language models still hallucinate in fact verification. Prior multi-agent debates either use only internal knowledge or a fixed external evidence pool, which makes them brittle when new claims or counterarguments appear during discussion. The paper asks: can heterogeneous tools plus iterative query rewriting and round-level grounding scores reduce hallucination and raise verification accuracy?

Main Contribution

Tool-MAD: a multi-agent debate framework that assigns different external tools to agents (RAG over a static corpus vs live Search API) and allows adaptive retrieval across debate rounds.

Adaptive query formulation: agents iteratively rewrite queries based on opponent answers to fetch new, targeted evidence during the debate.

Key Findings

Tool-MAD improves fact-verification accuracy over prior debate systems.

NumbersUp to +35.0% vs MAD and +5.5% vs MADKE on evaluated benchmarks

Practical UseUse heterogeneous retrieval tools and iterative debate to boost verification accuracy over single-tool or fixed-evidence debate systems.

Evidence RefAbstract; Conclusion; Main results (Table III)

Adaptive query rewriting helps retrieval and accuracy.

NumbersFEVER +2.0, FEVEROUS +2.5, AVeriTeC +1.0 when using query formulation

Practical UseAllow agents to reformulate queries mid-debate to surface missing or more specific evidence.

Evidence RefAblation on query formulation (Section IV-E; Fig.9)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Exact Match (average)	71.0 (Tool-MAD, GPT-4o group average)	MADKE average 68.0; MAD average 52.9	+3.0 vs MADKE; +18.1 vs MAD (average)	FEVER, FEVEROUS, FAVIQ, AVERITEC (avg)	Table III	Table III
Exact Match (average)	74.0 (Tool-MAD, Llama-3.3-70B group average)	MADKE average 56.5; MAD average 45.9 (same group)	+17.5 vs MADKE; +28.1 vs MAD (average)	FEVER, FEVEROUS, FAVIQ, AVERITEC (avg)	Table III	Table III

What To Try In 7 Days

Run a two‑agent pipeline: one RAG over your corpus and one web search API to compare outputs on 200 representative claims.

Add faithfulness + answer‑relevance scoring per output and reject low‑scoring answers to reduce false positives.

Enable one round of query rewriting based on counterarguments and measure EM/accuracy lift versus single-pass retrieval.

Agent Features

Memory

short-term debate history (per-claim rounds)

Planning

iterative query rewriting across rounds

Tool Use

RAG (vector retrieval)live Search APIdocument summarization (in PubMedQA pipeline)

Frameworks

RAGAS metrics (faithfulness, answer relevance)

Is Agentic

Yes

Architectures

multi-agent debateseparate RAG and Search agentsJudge aggregator

Collaboration

adversarial/collaborative debate with Judge resolution

Optimization Features

Token Efficiency

three-round cap to limit extra LLM calls

Infra Optimization

use of Milvus vector DB for fast semantic search

Inference Optimization

early termination when agents reach consensus

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

FEVERFEVEROUSFAVIQAVERITECMEDQAPubMedQA

Risks & Boundaries

Limitations

Higher inference cost and latency: multi-round debates trigger multiple LLM calls per claim.

Experiments use 200 sampled instances per dataset; large-scale variability is untested.

When Not To Use

If strict latency or low-cost inference is required (real-time apps).

When only a single trusted, high-precision data source is available (RAG alone may suffice).

Failure Modes

Tool disagreement on recency: web search may contradict label timestamps and cause unstable outcomes.

Over-debate speculation: extra rounds can amplify unsupported inferences and slightly reduce accuracy beyond round 3.

Core Entities

Models

GPT-4o-miniGPT-4oLlama-3.3-70B-InstructTurboDeepseekR1

Metrics

Exact MatchAccuracyFaithfulnessAnswer RelevanceStability Score

Datasets

FEVERFEVEROUSFAVIQAVERITECMEDQAPubMedQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Tool-MAD improves fact-verification accuracy over prior debate systems.

Adaptive query rewriting helps retrieval and accuracy.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding