Overview
The method is conceptually straightforward and effective on benchmarks, but it increases latency and API costs due to multiple LLM calls and external retrievals.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/6
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
If your product relies on factual outputs from LLMs, combining complementary retrieval tools and round‑wise grounding checks reduces hallucinations and raises verification accuracy with moderate engineering cost.
Who Should Care
Summary TLDR
Tool-MAD is a multi-agent debate system where two agents use different external tools (a RAG module over a static corpus and a live web search API) and iteratively rewrite queries during a multi-round debate. Each answer is scored for faithfulness (grounding in retrieved documents) and answer relevance; a Judge agent uses these stability scores plus debate history to pick the final label. On four fact‑verification benchmarks and two medical QA sets, Tool‑MAD improves exact-match accuracy versus prior debate systems (reported up to +35% vs MAD and +5.5% vs MADKE on evaluated datasets). The framework is most helpful when evidence sources are complementary, query rewriting is enabled, and three
Problem Statement
Large language models still hallucinate in fact verification. Prior multi-agent debates either use only internal knowledge or a fixed external evidence pool, which makes them brittle when new claims or counterarguments appear during discussion. The paper asks: can heterogeneous tools plus iterative query rewriting and round-level grounding scores reduce hallucination and raise verification accuracy?
Main Contribution
Tool-MAD: a multi-agent debate framework that assigns different external tools to agents (RAG over a static corpus vs live Search API) and allows adaptive retrieval across debate rounds.
Adaptive query formulation: agents iteratively rewrite queries based on opponent answers to fetch new, targeted evidence during the debate.
Key Findings
Tool-MAD improves fact-verification accuracy over prior debate systems.
Adaptive query rewriting helps retrieval and accuracy.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Exact Match (average) | 71.0 (Tool-MAD, GPT-4o group average) | MADKE average 68.0; MAD average 52.9 | +3.0 vs MADKE; +18.1 vs MAD (average) | FEVER, FEVEROUS, FAVIQ, AVERITEC (avg) | Table III | Table III |
| Exact Match (average) | 74.0 (Tool-MAD, Llama-3.3-70B group average) | MADKE average 56.5; MAD average 45.9 (same group) | +17.5 vs MADKE; +28.1 vs MAD (average) | FEVER, FEVEROUS, FAVIQ, AVERITEC (avg) | Table III | Table III |
What To Try In 7 Days
Run a two‑agent pipeline: one RAG over your corpus and one web search API to compare outputs on 200 representative claims.
Add faithfulness + answer‑relevance scoring per output and reject low‑scoring answers to reduce false positives.
Enable one round of query rewriting based on counterarguments and measure EM/accuracy lift versus single-pass retrieval.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Higher inference cost and latency: multi-round debates trigger multiple LLM calls per claim.
Experiments use 200 sampled instances per dataset; large-scale variability is untested.
When Not To Use
If strict latency or low-cost inference is required (real-time apps).
When only a single trusted, high-precision data source is available (RAG alone may suffice).
Failure Modes
Tool disagreement on recency: web search may contradict label timestamps and cause unstable outcomes.
Over-debate speculation: extra rounds can amplify unsupported inferences and slightly reduce accuracy beyond round 3.

