Let two agents use different retrieval tools and iteratively query the web to cut hallucinations in fact-checking

January 8, 20269 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

0

Authors

Seyeon Jeong, Yeonjun Choi, JongWook Kim, Beakcheol Jang

Links

Abstract / PDF

Why It Matters For Business

If your product relies on factual outputs from LLMs, combining complementary retrieval tools and round‑wise grounding checks reduces hallucinations and raises verification accuracy with moderate engineering cost.

Summary TLDR

Tool-MAD is a multi-agent debate system where two agents use different external tools (a RAG module over a static corpus and a live web search API) and iteratively rewrite queries during a multi-round debate. Each answer is scored for faithfulness (grounding in retrieved documents) and answer relevance; a Judge agent uses these stability scores plus debate history to pick the final label. On four fact‑verification benchmarks and two medical QA sets, Tool‑MAD improves exact-match accuracy versus prior debate systems (reported up to +35% vs MAD and +5.5% vs MADKE on evaluated datasets). The framework is most helpful when evidence sources are complementary, query rewriting is enabled, and three

Problem Statement

Large language models still hallucinate in fact verification. Prior multi-agent debates either use only internal knowledge or a fixed external evidence pool, which makes them brittle when new claims or counterarguments appear during discussion. The paper asks: can heterogeneous tools plus iterative query rewriting and round-level grounding scores reduce hallucination and raise verification accuracy?

Main Contribution

Tool-MAD: a multi-agent debate framework that assigns different external tools to agents (RAG over a static corpus vs live Search API) and allows adaptive retrieval across debate rounds.

Adaptive query formulation: agents iteratively rewrite queries based on opponent answers to fetch new, targeted evidence during the debate.

Stability score: integrate faithfulness (evidence grounding) and answer relevance (question alignment) into round-level scoring to detect hallucinations and guide the Judge agent's final decision.

Comprehensive evaluation on four fact verification and two medical QA datasets plus ablations showing benefits of tool diversity, query rewriting, and scoring feedback.

Key Findings

Tool-MAD improves fact-verification accuracy over prior debate systems.

NumbersUp to +35.0% vs MAD and +5.5% vs MADKE on evaluated benchmarks

Adaptive query rewriting helps retrieval and accuracy.

NumbersFEVER +2.0, FEVEROUS +2.5, AVeriTeC +1.0 when using query formulation

Round-level scoring (stability score) improves final accuracy and filters low-quality answers.

NumbersScoring feedback yields +4.5% EM on FAVIQ; improves EM across datasets

A small number of debate rounds is optimal in practice.

NumbersAccuracy improves up to round 3 then slightly declines at round 4 on FEVER

Tool-MAD is robust across domains and tool swaps.

NumbersOutperforms baselines on MEDQA (77 EM) and PubMedQA (29 EM); maintains gains when switching RAG corpora or search index

Results

Exact Match (average)

Value71.0 (Tool-MAD, GPT-4o group average)

BaselineMADKE average 68.0; MAD average 52.9

Exact Match (average)

Value74.0 (Tool-MAD, Llama-3.3-70B group average)

BaselineMADKE average 56.5; MAD average 45.9 (same group)

Exact Match (dataset-level)

ValueFEVER 0.73 (Tool-MAD with GPT‑4o‑mini)

Exact Match (medical QA)

ValueMEDQA 77; PubMedQA 29 (Tool-MAD)

BaselineMAD: MEDQA 58 / PubMedQA 22.5; MADKE: MEDQA 74 / PubMedQA 21.5

Ablation: query formulation

ValueFEVER +2.0; FEVEROUS +2.5; AVeriTeC +1.0

BaselineTool-MAD without query rewriting

Ablation: scoring feedback

ValueFAVIQ +4.5 EM with stability-score feedback

BaselineTool-MAD without scoring-based validation

Who Should Care

What To Try In 7 Days

Run a two‑agent pipeline: one RAG over your corpus and one web search API to compare outputs on 200 representative claims.

Add faithfulness + answer‑relevance scoring per output and reject low‑scoring answers to reduce false positives.

Enable one round of query rewriting based on counterarguments and measure EM/accuracy lift versus single-pass retrieval.

Agent Features

Memory

  • short-term debate history (per-claim rounds)

Planning

  • iterative query rewriting across rounds

Tool Use

  • RAG (vector retrieval)
  • live Search API
  • document summarization (in PubMedQA pipeline)

Frameworks

  • RAGAS metrics (faithfulness, answer relevance)

Is Agentic

true

Architectures

  • multi-agent debate
  • separate RAG and Search agents
  • Judge aggregator

Collaboration

  • adversarial/collaborative debate with Judge resolution

Optimization Features

Token Efficiency

  • three-round cap to limit extra LLM calls

Infra Optimization

  • use of Milvus vector DB for fast semantic search

Inference Optimization

  • early termination when agents reach consensus

Reproducibility

Data Urls

  • FEVER
  • FEVEROUS
  • FAVIQ
  • AVERITEC
  • MEDQA
  • PubMedQA

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Higher inference cost and latency: multi-round debates trigger multiple LLM calls per claim.
  • Experiments use 200 sampled instances per dataset; large-scale variability is untested.
  • System depends on external search API availability and index quality.
  • Stability thresholds (0.7/0.8) are empirically chosen and may need retuning per domain.
  • Current setup uses only two debater agents; more agents or richer judges were not explored.

When Not To Use

  • If strict latency or low-cost inference is required (real-time apps).
  • When only a single trusted, high-precision data source is available (RAG alone may suffice).
  • If you cannot afford external API calls or do not have a curated retrieval corpus.

Failure Modes

  • Tool disagreement on recency: web search may contradict label timestamps and cause unstable outcomes.
  • Over-debate speculation: extra rounds can amplify unsupported inferences and slightly reduce accuracy beyond round 3.
  • Redundant retrievals when both agents use similar tools, limiting diversity gains.

Core Entities

Models

  • GPT-4o-mini
  • GPT-4o
  • Llama-3.3-70B-InstructTurbo
  • DeepseekR1

Metrics

  • Exact Match
  • Accuracy
  • Faithfulness
  • Answer Relevance
  • Stability Score

Datasets

  • FEVER
  • FEVEROUS
  • FAVIQ
  • AVERITEC
  • MEDQA
  • PubMedQA