Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.3
Citation Count
0
Why It Matters For Business
If your product needs explainable insights from multiple tables, this benchmark and evaluator help measure real-world retrieval+analysis performance and reveal where current LLMs fail to ground facts.
Summary TLDR
The authors release MT-RAIG BENCH, a large-scale dataset (18.5k test examples) for retrieval-augmented insight generation across multiple tables, and MT-RAIG EVAL, a decomposition-based automatic evaluator that better matches human judgments for faithfulness and completeness. Experiments show current retrievers and LLMs struggle: retrieval trade-offs and noisy tables hurt factual grounding, and even top LLMs reach only ~40% faithfulness and ~60% completeness on gold tables. Use DPR-style retrieval, limit retrieved tables to the model's context window, and measure with MT-RAIG EVAL for closer alignment to humans.
Problem Statement
Existing table reasoning tests assume a single gold table is given. Real users usually need insights that draw evidence from many unknown tables and need reliable automatic checks for faithfulness and completeness.
Main Contribution
MT-RAIG BENCH: a large benchmark (18,532 test examples) for retrieval-augmented insight generation across multiple tables.
MT-RAIG EVAL: a decomposition-based automatic evaluator that breaks insights into table-linked claims and question-aware topics to score faithfulness and completeness.
Comprehensive baseline study showing retrieval and generation gaps and characterizing noise, scaling, and retriever effects.
Key Findings
Dataset scale and structure
MT-RAIG EVAL aligns better with humans than prior automatic metrics
General-purpose dense retrieval (DPR) outperforms table-specific retrievers on this task
Generation remains weak on faithfulness even with gold tables
Adding more retrieved tables helps up to a point, then hurts
Results
MT-RAIG EVAL Pearson correlation with human preference
Multi-table retrieval recall (top-10)
Model generation performance (closed-domain, gold tables)
Benchmark size and complexity
Who Should Care
What To Try In 7 Days
Run DPR-style dense retrieval with top-10 tables and evaluate generation with MT-RAIG EVAL to compare systems quickly
Tune the number of retrieved tables (k) and measure faithfulness vs completeness to find the sweet spot for your model
Add an automatic claim-level verifier (table-aware decomposition) to catch ungrounded facts before user exposure
Agent Features
Tool Use
- LLM-based annotator and verifier used for annotation and evaluation
Optimization Features
Training Optimization
- LoRA
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- MT-RAIG BENCH relies partly on machine-generated questions and insights; this can reduce linguistic diversity or create alignment with generator artifacts.
- Dataset covers relational DB and Wikipedia tables only; performance in specialized domains (finance, medical) is untested.
- MT-RAIG EVAL uses LLM-based decomposer/verifier and can inherit backbone biases despite reproducibility checks.
When Not To Use
- When you only need single-table fact extraction (single-table TQA)
- For high-stakes production without an added human or programmatic verifier for claims
- When you require domain-specific tables not covered by the benchmark (e.g., medical tables) without validation
Failure Modes
- Models hallucinate facts when irrelevant tables are present or retrieval misses key tables
- Retrieval noise reduces faithfulness even if completeness stays acceptable
- Automatic evaluation can be biased by the backbone LLM used for MT-RAIG EVAL
Core Entities
Models
- DPR
- BM25
- Contriever
- DTR
- TableLlama
- o3-mini
- GPT-4o
- Claude 3.5 Sonnet
- DeepSeek-R1
- Qwen2-7B
- Gemma-7B
- Llama 3.1-8B
- Mistral-7B
- Chain-of-Table
- TaPERA
- Dater
Metrics
- MT-RAIG EVAL
- SacreBLEU
- ROUGE-L
- METEOR
- BERTScore
- A3CU
- TAPAS-Acc
- G-Eval
- Recall@k
Datasets
- MT-RAIG BENCH (this work)
- SPIDER
- Open-WikiTable
Benchmarks
- MT-RAIG BENCH

