Split TableQA into a Data Leader plus Database and Knowledge-Graph teams to cut hallucinations and boost multi-hop answers

Overview

Decision SnapshotNeeds Validation

The system shows consistent, statistically significant accuracy gains on benchmarks, but raises token and engineering costs; adapt call limits and KG scope for production.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Tong Wang, Chi Jin, Yongkang Chen, Huan Deng, Xiaohui Kuang, Gang Zhao

Links

Abstract / PDF / Code

Why It Matters For Business

DataFactory trades higher query cost for much better accuracy and explainability on complex table queries, making it useful for teams that need reliable multi-hop analytics and traceable evidence from enterprise tables.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist Founder

Summary TLDR

DataFactory is a multi-agent TableQA system: a central Data Leader (using ReAct-style reasoning) coordinates a Database Team (SQL) and a Knowledge Graph Team (Cypher/Neo4j). It builds a knowledge graph from tables, stores historical QA examples in a vector DB for retrieval, and uses context-engineered prompts to reduce hallucination. Evaluated across TabFact, WikiTableQuestions, and FeTaQA with eight LLMs, DataFactory reports large average gains over baselines (≈+20.2% TabFact, +23.9% WikiTQ) while trading higher token cost for clearer, multi-step reasoning and explainable provenance.

Problem Statement

Current LLM-based TableQA struggles with limited context length, hallucinations, and weak multi-hop relational reasoning. Single-agent pipelines mix tasks (query generation, retrieval, analysis) and lack specialization, making complex table + relationship questions unreliable and hard to trace.

Main Contribution

A tripartite multi-agent architecture: Data Leader (planner), Database Team (SQL), and Knowledge Graph Team (Cypher/Neo4j) for complementary skills.

A formal data-to-knowledge-graph mapping (𝒯: D×S×R→G) and practical algorithms for entity extraction, ID generation, merging, and relationship discovery.

Key Findings

Multi-agent DataFactory significantly improves accuracy on standard TableQA benchmarks versus baselines.

NumbersTabFact avg 84.0% (↑20.2% over baselines)

Practical UseIf you replace single-agent TableQA with DataFactory, expect large accuracy gains on verification-style table tasks; useful when correctness matters more than prompt cost.

Evidence RefTable 3; RQ1

Knowledge Graph team provides consistent gains when added to SQL-only pipelines.

NumbersAvg improvements: TabFact +5.5%, WikiTQ +14.4%, FeTaQA ROUGE-2 +17.1%

Practical UseAdd a lightweight KG layer for multi-hop or relationship-heavy queries to improve difficult retrieval and free-form answers.

Evidence RefTable 6; RQ4 ablation

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	84.0%	other methods average	↑20.2% vs baselines	TabFact	Table 3 shows DataFactory average 84.0% with +20.2% improvement	Table 3; RQ1
Accuracy	72.8%	other methods average	↑23.9% vs baselines	WikiTableQuestions	Table 3 shows DataFactory average 72.8% with +23.9% improvement	Table 3; RQ1

What To Try In 7 Days

Run DataFactory on one critical table: compare SQL-only answers to DataFactory output for 20 representative queries.

Build a tiny knowledge graph (1–3 tables) and run a set of multi-hop questions to measure KG gains.

Log token use and set a default 1–3 call limit; measure accuracy vs cost to set a production stopping rule.

Agent Features

Memory

Historical QA stored as vector embeddings for retrievalSession history for multi-turn clarification

Planning

ReAct paradigm (reason + act loops)Three-stage explore-verify-analyzeAdaptive step-count planning

Tool Use

SQL executionCypher queries (Neo4j)Vector DB retrieval of historical QALLM prompting for planning and synthesis

Frameworks

ReActRetrieval-augmented prompting

Is Agentic

Yes

Architectures

Tripartite: Data Leader + Database Team + Knowledge Graph Team

Collaboration

Natural-language consultation between agentsDynamic team dispatch based on capability and evidenceConflict arbitration via provenance checks

Optimization Features

Token Efficiency

Measured token cost: TabFact 3,464; WikiTQ 4,982Trade-off: higher tokens but clearer multi-step reasoning

Infra Optimization

Dual deployment modes: local open-source or cloud APIVector DB for efficient semantic retrieval

System Optimization

Modular architecture supports local or cloud deploymentBatch Neo4j ingestion and in-memory construction for KG building

Inference Optimization

Adaptive stopping to limit multi-agent calls (recommended 1–3)Streamlined use of SQL/Cypher only (avoid heavy code execution)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://wisdomindata.netlify.app/ (demo; code to be released later)

Risks & Boundaries

Limitations

Higher token and interaction cost compared to prompt-only methods

Automated KG construction can fail or over-merge on noisy, inconsistent tables

When Not To Use

When single-step SQL answers are sufficient and latency/cost must be minimal

When strict data privacy forbids cloud LLM APIs and no local LLM is viable

Failure Modes

Excessive multi-agent calls causing error accumulation and performance collapse

Specification violations when role boundaries are not enforced

Core Entities

Models

Claude 4.0 SonnetGemini 2.5 FlashGPT-4o miniDeepseek-V3Qwen3-235B-A22BQwen3-32BQwen3-30B-A3BQwen3-14B

Metrics

AccuracyExact MatchROUGE-1ROUGE-2ROUGE-LToken usage

Datasets

TabFactWikiTableQuestionsFeTaQA

Benchmarks

TabFactWikiTableQuestionsFeTaQA

Context Entities

Models

TAPASTAPEX

Metrics

AccuracyROUGE

Datasets

TabFactWikiTableQuestionsFeTaQA

Benchmarks

TabFactWikiTableQuestionsFeTaQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Multi-agent DataFactory significantly improves accuracy on standard TableQA benchmarks versus baselines.

Knowledge Graph team provides consistent gains when added to SQL-only pipelines.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

RAPS: intent-driven, reputation-aware publish–subscribe for adaptive multi-agent LLM coordination

Key finding

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

ACP: a layered, federated protocol for secure cross-platform agent-to-agent collaboration

Key finding