A large benchmark and finer evaluation method for generating grounded insights that pull evidence from multiple tables

February 17, 20257 min

Overview

Decision SnapshotNeeds Validation

The benchmark and evaluator are research-ready and useful for system comparison; however, generation faithfulness remains low so add verification before production deployment.

Citations0

Evidence Strength0.70

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 50%

Novelty: 60%

Authors

Kwangwook Seo, Donguk Kwon, Dongha Lee

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product needs explainable insights from multiple tables, this benchmark and evaluator help measure real-world retrieval+analysis performance and reveal where current LLMs fail to ground facts.

Who Should Care

Summary TLDR

The authors release MT-RAIG BENCH, a large-scale dataset (18.5k test examples) for retrieval-augmented insight generation across multiple tables, and MT-RAIG EVAL, a decomposition-based automatic evaluator that better matches human judgments for faithfulness and completeness. Experiments show current retrievers and LLMs struggle: retrieval trade-offs and noisy tables hurt factual grounding, and even top LLMs reach only ~40% faithfulness and ~60% completeness on gold tables. Use DPR-style retrieval, limit retrieved tables to the model's context window, and measure with MT-RAIG EVAL for closer alignment to humans.

Problem Statement

Existing table reasoning tests assume a single gold table is given. Real users usually need insights that draw evidence from many unknown tables and need reliable automatic checks for faithfulness and completeness.

Main Contribution

MT-RAIG BENCH: a large benchmark (18,532 test examples) for retrieval-augmented insight generation across multiple tables.

MT-RAIG EVAL: a decomposition-based automatic evaluator that breaks insights into table-linked claims and question-aware topics to score faithfulness and completeness.

Key Findings

Dataset scale and structure

Numbers18,532 test examples; 19,563 unique tables; avg gold tables / example = 2.88

Practical UseYou can train and test multi-table retrieval+generation systems at scale; expect multi-table aggregation rather than single-table answers.

Evidence RefTable 2

MT-RAIG EVAL aligns better with humans than prior automatic metrics

NumbersPearson corr: Faithfulness 64.94, Completeness 67.67 (MT-RAIG EVAL) vs G-Eval Faith. 47.82

Practical UseUse MT-RAIG EVAL to rank systems when human evaluation is costly—it's closer to human preference on faithfulness and completeness.

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MT-RAIG EVAL Pearson correlation with human preferenceFaithfulness 64.94; Completeness 67.67G-Eval (Faithfulness 47.82)Faithfulness +17.12 vs G-EvalMeta-evaluation (250 response pairs)MT-RAIG EVAL shows higher Pearson correlation with human judgmentsTable 5
Multi-table retrieval recall (top-10)DPR R@10 = 80.83%DTR R@10 = 74.50%DPR +6.33 pp vs DTRMT-RAIG BENCH retrieval testGeneral dense retriever (DPR) achieved highest top-10 recallTable 19

What To Try In 7 Days

Run DPR-style dense retrieval with top-10 tables and evaluate generation with MT-RAIG EVAL to compare systems quickly

Tune the number of retrieved tables (k) and measure faithfulness vs completeness to find the sweet spot for your model

Add an automatic claim-level verifier (table-aware decomposition) to catch ungrounded facts before user exposure

Agent Features

Tool Use
LLM-based annotator and verifier used for annotation and evaluation

Optimization Features

Training Optimization
LoRA

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

MT-RAIG BENCH relies partly on machine-generated questions and insights; this can reduce linguistic diversity or create alignment with generator artifacts.

Dataset covers relational DB and Wikipedia tables only; performance in specialized domains (finance, medical) is untested.

When Not To Use

When you only need single-table fact extraction (single-table TQA)

For high-stakes production without an added human or programmatic verifier for claims

Failure Modes

Models hallucinate facts when irrelevant tables are present or retrieval misses key tables

Retrieval noise reduces faithfulness even if completeness stays acceptable

Core Entities

Models

DPRBM25ContrieverDTRTableLlamao3-miniGPT-4oClaude 3.5 SonnetDeepSeek-R1Qwen2-7BGemma-7BLlama 3.1-8BMistral-7BChain-of-TableTaPERADater

Metrics

MT-RAIG EVALSacreBLEUROUGE-LMETEORBERTScoreA3CUTAPAS-AccG-EvalRecall@k

Datasets

MT-RAIG BENCH (this work)SPIDEROpen-WikiTable

Benchmarks

MT-RAIG BENCH