Let two agents use different retrieval tools and iteratively query the web to cut hallucinations in fact-checking

January 8, 20269 min

Overview

Decision SnapshotNeeds Validation

The method is conceptually straightforward and effective on benchmarks, but it increases latency and API costs due to multiple LLM calls and external retrievals.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Seyeon Jeong, Yeonjun Choi, JongWook Kim, Beakcheol Jang

Links

Abstract / PDF / Data

Why It Matters For Business

If your product relies on factual outputs from LLMs, combining complementary retrieval tools and round‑wise grounding checks reduces hallucinations and raises verification accuracy with moderate engineering cost.

Who Should Care

Summary TLDR

Tool-MAD is a multi-agent debate system where two agents use different external tools (a RAG module over a static corpus and a live web search API) and iteratively rewrite queries during a multi-round debate. Each answer is scored for faithfulness (grounding in retrieved documents) and answer relevance; a Judge agent uses these stability scores plus debate history to pick the final label. On four fact‑verification benchmarks and two medical QA sets, Tool‑MAD improves exact-match accuracy versus prior debate systems (reported up to +35% vs MAD and +5.5% vs MADKE on evaluated datasets). The framework is most helpful when evidence sources are complementary, query rewriting is enabled, and three

Problem Statement

Large language models still hallucinate in fact verification. Prior multi-agent debates either use only internal knowledge or a fixed external evidence pool, which makes them brittle when new claims or counterarguments appear during discussion. The paper asks: can heterogeneous tools plus iterative query rewriting and round-level grounding scores reduce hallucination and raise verification accuracy?

Main Contribution

Tool-MAD: a multi-agent debate framework that assigns different external tools to agents (RAG over a static corpus vs live Search API) and allows adaptive retrieval across debate rounds.

Adaptive query formulation: agents iteratively rewrite queries based on opponent answers to fetch new, targeted evidence during the debate.

Key Findings

Tool-MAD improves fact-verification accuracy over prior debate systems.

NumbersUp to +35.0% vs MAD and +5.5% vs MADKE on evaluated benchmarks

Practical UseUse heterogeneous retrieval tools and iterative debate to boost verification accuracy over single-tool or fixed-evidence debate systems.

Evidence RefAbstract; Conclusion; Main results (Table III)

Adaptive query rewriting helps retrieval and accuracy.

NumbersFEVER +2.0, FEVEROUS +2.5, AVeriTeC +1.0 when using query formulation

Practical UseAllow agents to reformulate queries mid-debate to surface missing or more specific evidence.

Evidence RefAblation on query formulation (Section IV-E; Fig.9)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Exact Match (average)71.0 (Tool-MAD, GPT-4o group average)MADKE average 68.0; MAD average 52.9+3.0 vs MADKE; +18.1 vs MAD (average)FEVER, FEVEROUS, FAVIQ, AVERITEC (avg)Table IIITable III
Exact Match (average)74.0 (Tool-MAD, Llama-3.3-70B group average)MADKE average 56.5; MAD average 45.9 (same group)+17.5 vs MADKE; +28.1 vs MAD (average)FEVER, FEVEROUS, FAVIQ, AVERITEC (avg)Table IIITable III

What To Try In 7 Days

Run a two‑agent pipeline: one RAG over your corpus and one web search API to compare outputs on 200 representative claims.

Add faithfulness + answer‑relevance scoring per output and reject low‑scoring answers to reduce false positives.

Enable one round of query rewriting based on counterarguments and measure EM/accuracy lift versus single-pass retrieval.

Agent Features

Memory
short-term debate history (per-claim rounds)
Planning
iterative query rewriting across rounds
Tool Use
RAG (vector retrieval)live Search APIdocument summarization (in PubMedQA pipeline)
Frameworks
RAGAS metrics (faithfulness, answer relevance)
Is Agentic

Yes

Architectures
multi-agent debateseparate RAG and Search agentsJudge aggregator
Collaboration
adversarial/collaborative debate with Judge resolution

Optimization Features

Token Efficiency
three-round cap to limit extra LLM calls
Infra Optimization
use of Milvus vector DB for fast semantic search
Inference Optimization
early termination when agents reach consensus

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

FEVERFEVEROUSFAVIQAVERITECMEDQAPubMedQA

Risks & Boundaries

Limitations

Higher inference cost and latency: multi-round debates trigger multiple LLM calls per claim.

Experiments use 200 sampled instances per dataset; large-scale variability is untested.

When Not To Use

If strict latency or low-cost inference is required (real-time apps).

When only a single trusted, high-precision data source is available (RAG alone may suffice).

Failure Modes

Tool disagreement on recency: web search may contradict label timestamps and cause unstable outcomes.

Over-debate speculation: extra rounds can amplify unsupported inferences and slightly reduce accuracy beyond round 3.

Core Entities

Models

GPT-4o-miniGPT-4oLlama-3.3-70B-InstructTurboDeepseekR1

Metrics

Exact MatchAccuracyFaithfulnessAnswer RelevanceStability Score

Datasets

FEVERFEVEROUSFAVIQAVERITECMEDQAPubMedQA