Overview
The benchmark and evaluator are research-ready and useful for system comparison; however, generation faithfulness remains low so add verification before production deployment.
Citations0
Evidence Strength0.70
Confidence0.90
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 30%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
If your product needs explainable insights from multiple tables, this benchmark and evaluator help measure real-world retrieval+analysis performance and reveal where current LLMs fail to ground facts.
Who Should Care
Summary TLDR
The authors release MT-RAIG BENCH, a large-scale dataset (18.5k test examples) for retrieval-augmented insight generation across multiple tables, and MT-RAIG EVAL, a decomposition-based automatic evaluator that better matches human judgments for faithfulness and completeness. Experiments show current retrievers and LLMs struggle: retrieval trade-offs and noisy tables hurt factual grounding, and even top LLMs reach only ~40% faithfulness and ~60% completeness on gold tables. Use DPR-style retrieval, limit retrieved tables to the model's context window, and measure with MT-RAIG EVAL for closer alignment to humans.
Problem Statement
Existing table reasoning tests assume a single gold table is given. Real users usually need insights that draw evidence from many unknown tables and need reliable automatic checks for faithfulness and completeness.
Main Contribution
MT-RAIG BENCH: a large benchmark (18,532 test examples) for retrieval-augmented insight generation across multiple tables.
MT-RAIG EVAL: a decomposition-based automatic evaluator that breaks insights into table-linked claims and question-aware topics to score faithfulness and completeness.
Key Findings
Dataset scale and structure
MT-RAIG EVAL aligns better with humans than prior automatic metrics
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| MT-RAIG EVAL Pearson correlation with human preference | Faithfulness 64.94; Completeness 67.67 | G-Eval (Faithfulness 47.82) | Faithfulness +17.12 vs G-Eval | Meta-evaluation (250 response pairs) | MT-RAIG EVAL shows higher Pearson correlation with human judgments | Table 5 |
| Multi-table retrieval recall (top-10) | DPR R@10 = 80.83% | DTR R@10 = 74.50% | DPR +6.33 pp vs DTR | MT-RAIG BENCH retrieval test | General dense retriever (DPR) achieved highest top-10 recall | Table 19 |
What To Try In 7 Days
Run DPR-style dense retrieval with top-10 tables and evaluate generation with MT-RAIG EVAL to compare systems quickly
Tune the number of retrieved tables (k) and measure faithfulness vs completeness to find the sweet spot for your model
Add an automatic claim-level verifier (table-aware decomposition) to catch ungrounded facts before user exposure
Agent Features
Tool Use
Optimization Features
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
MT-RAIG BENCH relies partly on machine-generated questions and insights; this can reduce linguistic diversity or create alignment with generator artifacts.
Dataset covers relational DB and Wikipedia tables only; performance in specialized domains (finance, medical) is untested.
When Not To Use
When you only need single-table fact extraction (single-table TQA)
For high-stakes production without an added human or programmatic verifier for claims
Failure Modes
Models hallucinate facts when irrelevant tables are present or retrieval misses key tables
Retrieval noise reduces faithfulness even if completeness stays acceptable

