Split TableQA into a Data Leader plus Database and Knowledge-Graph teams to cut hallucinations and boost multi-hop answers

March 10, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Tong Wang, Chi Jin, Yongkang Chen, Huan Deng, Xiaohui Kuang, Gang Zhao

Links

Abstract / PDF

Why It Matters For Business

DataFactory trades higher query cost for much better accuracy and explainability on complex table queries, making it useful for teams that need reliable multi-hop analytics and traceable evidence from enterprise tables.

Summary TLDR

DataFactory is a multi-agent TableQA system: a central Data Leader (using ReAct-style reasoning) coordinates a Database Team (SQL) and a Knowledge Graph Team (Cypher/Neo4j). It builds a knowledge graph from tables, stores historical QA examples in a vector DB for retrieval, and uses context-engineered prompts to reduce hallucination. Evaluated across TabFact, WikiTableQuestions, and FeTaQA with eight LLMs, DataFactory reports large average gains over baselines (≈+20.2% TabFact, +23.9% WikiTQ) while trading higher token cost for clearer, multi-step reasoning and explainable provenance.

Problem Statement

Current LLM-based TableQA struggles with limited context length, hallucinations, and weak multi-hop relational reasoning. Single-agent pipelines mix tasks (query generation, retrieval, analysis) and lack specialization, making complex table + relationship questions unreliable and hard to trace.

Main Contribution

A tripartite multi-agent architecture: Data Leader (planner), Database Team (SQL), and Knowledge Graph Team (Cypher/Neo4j) for complementary skills.

A formal data-to-knowledge-graph mapping (𝒯: D×S×R→G) and practical algorithms for entity extraction, ID generation, merging, and relationship discovery.

Context-engineered retrieval-augmented prompting (historical QA + schema + domain knowledge) to reduce hallucination in Text-to-SQL/Cypher.

Extensive empirical study across three TableQA benchmarks and eight LLMs, plus ablations on KG integration and collaboration frequency.

Key Findings

Multi-agent DataFactory significantly improves accuracy on standard TableQA benchmarks versus baselines.

NumbersTabFact avg 84.0% (↑20.2% over baselines)

Knowledge Graph team provides consistent gains when added to SQL-only pipelines.

NumbersAvg improvements: TabFact +5.5%, WikiTQ +14.4%, FeTaQA ROUGE-2 +17.1%

Collaboration frequency has an optimal range; too many interactions hurt performance.

NumbersBest at 1–3 calls (TabFact peak 85.4%); WikiTQ drops to 20.2% at 10+ calls

DataFactory increases LLM token usage compared to prompt-only methods.

NumbersAvg tokens: TabFact 3,464; WikiTQ 4,982 (higher than prompt-based methods)

Results

Accuracy

Value84.0%

Baselineother methods average

Accuracy

Value72.8%

Baselineother methods average

FeTaQA ROUGE-2 F

ValueVaries by model (examples: Claude 0.3885, GPT-4o mini 0.3320)

Baselinemodel-specific no-KG configs

Average token usage (input+output)

ValueTabFact 3,464 tokens; WikiTQ 4,982 tokens

Baselinelower for prompt-only methods

Who Should Care

What To Try In 7 Days

Run DataFactory on one critical table: compare SQL-only answers to DataFactory output for 20 representative queries.

Build a tiny knowledge graph (1–3 tables) and run a set of multi-hop questions to measure KG gains.

Log token use and set a default 1–3 call limit; measure accuracy vs cost to set a production stopping rule.

Agent Features

Memory

  • Historical QA stored as vector embeddings for retrieval
  • Session history for multi-turn clarification

Planning

  • ReAct paradigm (reason + act loops)
  • Three-stage explore-verify-analyze
  • Adaptive step-count planning

Tool Use

  • SQL execution
  • Cypher queries (Neo4j)
  • Vector DB retrieval of historical QA
  • LLM prompting for planning and synthesis

Frameworks

  • ReAct
  • Retrieval-augmented prompting

Is Agentic

true

Architectures

  • Tripartite: Data Leader + Database Team + Knowledge Graph Team

Collaboration

  • Natural-language consultation between agents
  • Dynamic team dispatch based on capability and evidence
  • Conflict arbitration via provenance checks

Optimization Features

Token Efficiency

  • Measured token cost: TabFact 3,464; WikiTQ 4,982
  • Trade-off: higher tokens but clearer multi-step reasoning

Infra Optimization

  • Dual deployment modes: local open-source or cloud API
  • Vector DB for efficient semantic retrieval

System Optimization

  • Modular architecture supports local or cloud deployment
  • Batch Neo4j ingestion and in-memory construction for KG building

Inference Optimization

  • Adaptive stopping to limit multi-agent calls (recommended 1–3)
  • Streamlined use of SQL/Cypher only (avoid heavy code execution)

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Higher token and interaction cost compared to prompt-only methods
  • Automated KG construction can fail or over-merge on noisy, inconsistent tables
  • Performance depends on underlying LLM quality and may vary by model
  • Current work lacks released production code; demo site promises future release

When Not To Use

  • When single-step SQL answers are sufficient and latency/cost must be minimal
  • When strict data privacy forbids cloud LLM APIs and no local LLM is viable
  • For tiny tables where added KG construction overhead outweighs benefit
  • Real-time low-latency systems where multi-agent rounds cause unacceptable delay

Failure Modes

  • Excessive multi-agent calls causing error accumulation and performance collapse
  • Specification violations when role boundaries are not enforced
  • Hallucinated SQL/Cypher generation if context engineering is incomplete
  • KG merging conflicts creating incorrect entity links

Core Entities

Models

  • Claude 4.0 Sonnet
  • Gemini 2.5 Flash
  • GPT-4o mini
  • Deepseek-V3
  • Qwen3-235B-A22B
  • Qwen3-32B
  • Qwen3-30B-A3B
  • Qwen3-14B

Metrics

  • Accuracy
  • Exact Match
  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • Token usage

Datasets

  • TabFact
  • WikiTableQuestions
  • FeTaQA

Benchmarks

  • TabFact
  • WikiTableQuestions
  • FeTaQA

Context Entities

Models

  • TAPAS
  • TAPEX

Metrics

  • Accuracy
  • ROUGE

Datasets

  • TabFact
  • WikiTableQuestions
  • FeTaQA

Benchmarks

  • TabFact
  • WikiTableQuestions
  • FeTaQA