Split TableQA into a Data Leader plus Database and Knowledge-Graph teams to cut hallucinations and boost multi-hop answers

March 10, 20268 min

Overview

Decision SnapshotNeeds Validation

The system shows consistent, statistically significant accuracy gains on benchmarks, but raises token and engineering costs; adapt call limits and KG scope for production.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Tong Wang, Chi Jin, Yongkang Chen, Huan Deng, Xiaohui Kuang, Gang Zhao

Links

Abstract / PDF / Code

Why It Matters For Business

DataFactory trades higher query cost for much better accuracy and explainability on complex table queries, making it useful for teams that need reliable multi-hop analytics and traceable evidence from enterprise tables.

Who Should Care

Summary TLDR

DataFactory is a multi-agent TableQA system: a central Data Leader (using ReAct-style reasoning) coordinates a Database Team (SQL) and a Knowledge Graph Team (Cypher/Neo4j). It builds a knowledge graph from tables, stores historical QA examples in a vector DB for retrieval, and uses context-engineered prompts to reduce hallucination. Evaluated across TabFact, WikiTableQuestions, and FeTaQA with eight LLMs, DataFactory reports large average gains over baselines (≈+20.2% TabFact, +23.9% WikiTQ) while trading higher token cost for clearer, multi-step reasoning and explainable provenance.

Problem Statement

Current LLM-based TableQA struggles with limited context length, hallucinations, and weak multi-hop relational reasoning. Single-agent pipelines mix tasks (query generation, retrieval, analysis) and lack specialization, making complex table + relationship questions unreliable and hard to trace.

Main Contribution

A tripartite multi-agent architecture: Data Leader (planner), Database Team (SQL), and Knowledge Graph Team (Cypher/Neo4j) for complementary skills.

A formal data-to-knowledge-graph mapping (𝒯: D×S×R→G) and practical algorithms for entity extraction, ID generation, merging, and relationship discovery.

Key Findings

Multi-agent DataFactory significantly improves accuracy on standard TableQA benchmarks versus baselines.

NumbersTabFact avg 84.0% (↑20.2% over baselines)

Practical UseIf you replace single-agent TableQA with DataFactory, expect large accuracy gains on verification-style table tasks; useful when correctness matters more than prompt cost.

Evidence RefTable 3; RQ1

Knowledge Graph team provides consistent gains when added to SQL-only pipelines.

NumbersAvg improvements: TabFact +5.5%, WikiTQ +14.4%, FeTaQA ROUGE-2 +17.1%

Practical UseAdd a lightweight KG layer for multi-hop or relationship-heavy queries to improve difficult retrieval and free-form answers.

Evidence RefTable 6; RQ4 ablation

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy84.0%other methods average20.2% vs baselinesTabFactTable 3 shows DataFactory average 84.0% with +20.2% improvementTable 3; RQ1
Accuracy72.8%other methods average23.9% vs baselinesWikiTableQuestionsTable 3 shows DataFactory average 72.8% with +23.9% improvementTable 3; RQ1

What To Try In 7 Days

Run DataFactory on one critical table: compare SQL-only answers to DataFactory output for 20 representative queries.

Build a tiny knowledge graph (1–3 tables) and run a set of multi-hop questions to measure KG gains.

Log token use and set a default 1–3 call limit; measure accuracy vs cost to set a production stopping rule.

Agent Features

Memory
Historical QA stored as vector embeddings for retrievalSession history for multi-turn clarification
Planning
ReAct paradigm (reason + act loops)Three-stage explore-verify-analyzeAdaptive step-count planning
Tool Use
SQL executionCypher queries (Neo4j)Vector DB retrieval of historical QALLM prompting for planning and synthesis
Frameworks
ReActRetrieval-augmented prompting
Is Agentic

Yes

Architectures
Tripartite: Data Leader + Database Team + Knowledge Graph Team
Collaboration
Natural-language consultation between agentsDynamic team dispatch based on capability and evidenceConflict arbitration via provenance checks

Optimization Features

Token Efficiency
Measured token cost: TabFact 3,464; WikiTQ 4,982Trade-off: higher tokens but clearer multi-step reasoning
Infra Optimization
Dual deployment modes: local open-source or cloud APIVector DB for efficient semantic retrieval
System Optimization
Modular architecture supports local or cloud deploymentBatch Neo4j ingestion and in-memory construction for KG building
Inference Optimization
Adaptive stopping to limit multi-agent calls (recommended 1–3)Streamlined use of SQL/Cypher only (avoid heavy code execution)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Higher token and interaction cost compared to prompt-only methods

Automated KG construction can fail or over-merge on noisy, inconsistent tables

When Not To Use

When single-step SQL answers are sufficient and latency/cost must be minimal

When strict data privacy forbids cloud LLM APIs and no local LLM is viable

Failure Modes

Excessive multi-agent calls causing error accumulation and performance collapse

Specification violations when role boundaries are not enforced

Core Entities

Models

Claude 4.0 SonnetGemini 2.5 FlashGPT-4o miniDeepseek-V3Qwen3-235B-A22BQwen3-32BQwen3-30B-A3BQwen3-14B

Metrics

AccuracyExact MatchROUGE-1ROUGE-2ROUGE-LToken usage

Datasets

TabFactWikiTableQuestionsFeTaQA

Benchmarks

TabFactWikiTableQuestionsFeTaQA

Context Entities

Models

TAPASTAPEX

Metrics

AccuracyROUGE

Datasets

TabFactWikiTableQuestionsFeTaQA

Benchmarks

TabFactWikiTableQuestionsFeTaQA