Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
If your product relies on factual outputs from LLMs, combining complementary retrieval tools and round‑wise grounding checks reduces hallucinations and raises verification accuracy with moderate engineering cost.
Summary TLDR
Tool-MAD is a multi-agent debate system where two agents use different external tools (a RAG module over a static corpus and a live web search API) and iteratively rewrite queries during a multi-round debate. Each answer is scored for faithfulness (grounding in retrieved documents) and answer relevance; a Judge agent uses these stability scores plus debate history to pick the final label. On four fact‑verification benchmarks and two medical QA sets, Tool‑MAD improves exact-match accuracy versus prior debate systems (reported up to +35% vs MAD and +5.5% vs MADKE on evaluated datasets). The framework is most helpful when evidence sources are complementary, query rewriting is enabled, and three
Problem Statement
Large language models still hallucinate in fact verification. Prior multi-agent debates either use only internal knowledge or a fixed external evidence pool, which makes them brittle when new claims or counterarguments appear during discussion. The paper asks: can heterogeneous tools plus iterative query rewriting and round-level grounding scores reduce hallucination and raise verification accuracy?
Main Contribution
Tool-MAD: a multi-agent debate framework that assigns different external tools to agents (RAG over a static corpus vs live Search API) and allows adaptive retrieval across debate rounds.
Adaptive query formulation: agents iteratively rewrite queries based on opponent answers to fetch new, targeted evidence during the debate.
Stability score: integrate faithfulness (evidence grounding) and answer relevance (question alignment) into round-level scoring to detect hallucinations and guide the Judge agent's final decision.
Comprehensive evaluation on four fact verification and two medical QA datasets plus ablations showing benefits of tool diversity, query rewriting, and scoring feedback.
Key Findings
Tool-MAD improves fact-verification accuracy over prior debate systems.
Adaptive query rewriting helps retrieval and accuracy.
Round-level scoring (stability score) improves final accuracy and filters low-quality answers.
A small number of debate rounds is optimal in practice.
Tool-MAD is robust across domains and tool swaps.
Results
Exact Match (average)
Exact Match (average)
Exact Match (dataset-level)
Exact Match (medical QA)
Ablation: query formulation
Ablation: scoring feedback
Who Should Care
What To Try In 7 Days
Run a two‑agent pipeline: one RAG over your corpus and one web search API to compare outputs on 200 representative claims.
Add faithfulness + answer‑relevance scoring per output and reject low‑scoring answers to reduce false positives.
Enable one round of query rewriting based on counterarguments and measure EM/accuracy lift versus single-pass retrieval.
Agent Features
Memory
- short-term debate history (per-claim rounds)
Planning
- iterative query rewriting across rounds
Tool Use
- RAG (vector retrieval)
- live Search API
- document summarization (in PubMedQA pipeline)
Frameworks
- RAGAS metrics (faithfulness, answer relevance)
Is Agentic
true
Architectures
- multi-agent debate
- separate RAG and Search agents
- Judge aggregator
Collaboration
- adversarial/collaborative debate with Judge resolution
Optimization Features
Token Efficiency
- three-round cap to limit extra LLM calls
Infra Optimization
- use of Milvus vector DB for fast semantic search
Inference Optimization
- early termination when agents reach consensus
Reproducibility
Data Urls
- FEVER
- FEVEROUS
- FAVIQ
- AVERITEC
- MEDQA
- PubMedQA
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Higher inference cost and latency: multi-round debates trigger multiple LLM calls per claim.
- Experiments use 200 sampled instances per dataset; large-scale variability is untested.
- System depends on external search API availability and index quality.
- Stability thresholds (0.7/0.8) are empirically chosen and may need retuning per domain.
- Current setup uses only two debater agents; more agents or richer judges were not explored.
When Not To Use
- If strict latency or low-cost inference is required (real-time apps).
- When only a single trusted, high-precision data source is available (RAG alone may suffice).
- If you cannot afford external API calls or do not have a curated retrieval corpus.
Failure Modes
- Tool disagreement on recency: web search may contradict label timestamps and cause unstable outcomes.
- Over-debate speculation: extra rounds can amplify unsupported inferences and slightly reduce accuracy beyond round 3.
- Redundant retrievals when both agents use similar tools, limiting diversity gains.
Core Entities
Models
- GPT-4o-mini
- GPT-4o
- Llama-3.3-70B-InstructTurbo
- DeepseekR1
Metrics
- Exact Match
- Accuracy
- Faithfulness
- Answer Relevance
- Stability Score
Datasets
- FEVER
- FEVEROUS
- FAVIQ
- AVERITEC
- MEDQA
- PubMedQA

