Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
DataFactory trades higher query cost for much better accuracy and explainability on complex table queries, making it useful for teams that need reliable multi-hop analytics and traceable evidence from enterprise tables.
Summary TLDR
DataFactory is a multi-agent TableQA system: a central Data Leader (using ReAct-style reasoning) coordinates a Database Team (SQL) and a Knowledge Graph Team (Cypher/Neo4j). It builds a knowledge graph from tables, stores historical QA examples in a vector DB for retrieval, and uses context-engineered prompts to reduce hallucination. Evaluated across TabFact, WikiTableQuestions, and FeTaQA with eight LLMs, DataFactory reports large average gains over baselines (≈+20.2% TabFact, +23.9% WikiTQ) while trading higher token cost for clearer, multi-step reasoning and explainable provenance.
Problem Statement
Current LLM-based TableQA struggles with limited context length, hallucinations, and weak multi-hop relational reasoning. Single-agent pipelines mix tasks (query generation, retrieval, analysis) and lack specialization, making complex table + relationship questions unreliable and hard to trace.
Main Contribution
A tripartite multi-agent architecture: Data Leader (planner), Database Team (SQL), and Knowledge Graph Team (Cypher/Neo4j) for complementary skills.
A formal data-to-knowledge-graph mapping (𝒯: D×S×R→G) and practical algorithms for entity extraction, ID generation, merging, and relationship discovery.
Context-engineered retrieval-augmented prompting (historical QA + schema + domain knowledge) to reduce hallucination in Text-to-SQL/Cypher.
Extensive empirical study across three TableQA benchmarks and eight LLMs, plus ablations on KG integration and collaboration frequency.
Key Findings
Multi-agent DataFactory significantly improves accuracy on standard TableQA benchmarks versus baselines.
Knowledge Graph team provides consistent gains when added to SQL-only pipelines.
Collaboration frequency has an optimal range; too many interactions hurt performance.
DataFactory increases LLM token usage compared to prompt-only methods.
Results
Accuracy
Accuracy
FeTaQA ROUGE-2 F
Average token usage (input+output)
Who Should Care
What To Try In 7 Days
Run DataFactory on one critical table: compare SQL-only answers to DataFactory output for 20 representative queries.
Build a tiny knowledge graph (1–3 tables) and run a set of multi-hop questions to measure KG gains.
Log token use and set a default 1–3 call limit; measure accuracy vs cost to set a production stopping rule.
Agent Features
Memory
- Historical QA stored as vector embeddings for retrieval
- Session history for multi-turn clarification
Planning
- ReAct paradigm (reason + act loops)
- Three-stage explore-verify-analyze
- Adaptive step-count planning
Tool Use
- SQL execution
- Cypher queries (Neo4j)
- Vector DB retrieval of historical QA
- LLM prompting for planning and synthesis
Frameworks
- ReAct
- Retrieval-augmented prompting
Is Agentic
true
Architectures
- Tripartite: Data Leader + Database Team + Knowledge Graph Team
Collaboration
- Natural-language consultation between agents
- Dynamic team dispatch based on capability and evidence
- Conflict arbitration via provenance checks
Optimization Features
Token Efficiency
- Measured token cost: TabFact 3,464; WikiTQ 4,982
- Trade-off: higher tokens but clearer multi-step reasoning
Infra Optimization
- Dual deployment modes: local open-source or cloud API
- Vector DB for efficient semantic retrieval
System Optimization
- Modular architecture supports local or cloud deployment
- Batch Neo4j ingestion and in-memory construction for KG building
Inference Optimization
- Adaptive stopping to limit multi-agent calls (recommended 1–3)
- Streamlined use of SQL/Cypher only (avoid heavy code execution)
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Higher token and interaction cost compared to prompt-only methods
- Automated KG construction can fail or over-merge on noisy, inconsistent tables
- Performance depends on underlying LLM quality and may vary by model
- Current work lacks released production code; demo site promises future release
When Not To Use
- When single-step SQL answers are sufficient and latency/cost must be minimal
- When strict data privacy forbids cloud LLM APIs and no local LLM is viable
- For tiny tables where added KG construction overhead outweighs benefit
- Real-time low-latency systems where multi-agent rounds cause unacceptable delay
Failure Modes
- Excessive multi-agent calls causing error accumulation and performance collapse
- Specification violations when role boundaries are not enforced
- Hallucinated SQL/Cypher generation if context engineering is incomplete
- KG merging conflicts creating incorrect entity links
Core Entities
Models
- Claude 4.0 Sonnet
- Gemini 2.5 Flash
- GPT-4o mini
- Deepseek-V3
- Qwen3-235B-A22B
- Qwen3-32B
- Qwen3-30B-A3B
- Qwen3-14B
Metrics
- Accuracy
- Exact Match
- ROUGE-1
- ROUGE-2
- ROUGE-L
- Token usage
Datasets
- TabFact
- WikiTableQuestions
- FeTaQA
Benchmarks
- TabFact
- WikiTableQuestions
- FeTaQA
Context Entities
Models
- TAPAS
- TAPEX
Metrics
- Accuracy
- ROUGE
Datasets
- TabFact
- WikiTableQuestions
- FeTaQA
Benchmarks
- TabFact
- WikiTableQuestions
- FeTaQA

