A large benchmark and finer evaluation method for generating grounded insights that pull evidence from multiple tables

Overview

Decision SnapshotNeeds Validation

The benchmark and evaluator are research-ready and useful for system comparison; however, generation faithfulness remains low so add verification before production deployment.

Citations0

Evidence Strength0.70

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 50%

Novelty: 60%

Authors

Kwangwook Seo, Donguk Kwon, Dongha Lee

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product needs explainable insights from multiple tables, this benchmark and evaluator help measure real-world retrieval+analysis performance and reveal where current LLMs fail to ground facts.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Engineering Lead

Summary TLDR

The authors release MT-RAIG BENCH, a large-scale dataset (18.5k test examples) for retrieval-augmented insight generation across multiple tables, and MT-RAIG EVAL, a decomposition-based automatic evaluator that better matches human judgments for faithfulness and completeness. Experiments show current retrievers and LLMs struggle: retrieval trade-offs and noisy tables hurt factual grounding, and even top LLMs reach only ~40% faithfulness and ~60% completeness on gold tables. Use DPR-style retrieval, limit retrieved tables to the model's context window, and measure with MT-RAIG EVAL for closer alignment to humans.

Problem Statement

Existing table reasoning tests assume a single gold table is given. Real users usually need insights that draw evidence from many unknown tables and need reliable automatic checks for faithfulness and completeness.

Main Contribution

MT-RAIG BENCH: a large benchmark (18,532 test examples) for retrieval-augmented insight generation across multiple tables.

MT-RAIG EVAL: a decomposition-based automatic evaluator that breaks insights into table-linked claims and question-aware topics to score faithfulness and completeness.

Key Findings

Dataset scale and structure

Numbers18,532 test examples; 19,563 unique tables; avg gold tables / example = 2.88

Practical UseYou can train and test multi-table retrieval+generation systems at scale; expect multi-table aggregation rather than single-table answers.

Evidence RefTable 2

MT-RAIG EVAL aligns better with humans than prior automatic metrics

NumbersPearson corr: Faithfulness 64.94, Completeness 67.67 (MT-RAIG EVAL) vs G-Eval Faith. 47.82

Practical UseUse MT-RAIG EVAL to rank systems when human evaluation is costly—it's closer to human preference on faithfulness and completeness.

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
MT-RAIG EVAL Pearson correlation with human preference	Faithfulness 64.94; Completeness 67.67	G-Eval (Faithfulness 47.82)	Faithfulness +17.12 vs G-Eval	Meta-evaluation (250 response pairs)	MT-RAIG EVAL shows higher Pearson correlation with human judgments	Table 5
Multi-table retrieval recall (top-10)	DPR R@10 = 80.83%	DTR R@10 = 74.50%	DPR +6.33 pp vs DTR	MT-RAIG BENCH retrieval test	General dense retriever (DPR) achieved highest top-10 recall	Table 19

What To Try In 7 Days

Run DPR-style dense retrieval with top-10 tables and evaluate generation with MT-RAIG EVAL to compare systems quickly

Tune the number of retrieved tables (k) and measure faithfulness vs completeness to find the sweet spot for your model

Add an automatic claim-level verifier (table-aware decomposition) to catch ungrounded facts before user exposure

Agent Features

Tool Use

LLM-based annotator and verifier used for annotation and evaluation

Optimization Features

Training Optimization

LoRA

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://kwondu.github.io/mt-raig

Data URLs

https://kwondu.github.io/mt-raig

Risks & Boundaries

Limitations

MT-RAIG BENCH relies partly on machine-generated questions and insights; this can reduce linguistic diversity or create alignment with generator artifacts.

Dataset covers relational DB and Wikipedia tables only; performance in specialized domains (finance, medical) is untested.

When Not To Use

When you only need single-table fact extraction (single-table TQA)

For high-stakes production without an added human or programmatic verifier for claims

Failure Modes

Models hallucinate facts when irrelevant tables are present or retrieval misses key tables

Retrieval noise reduces faithfulness even if completeness stays acceptable

Core Entities

Models

DPRBM25ContrieverDTRTableLlamao3-miniGPT-4oClaude 3.5 SonnetDeepSeek-R1Qwen2-7BGemma-7BLlama 3.1-8BMistral-7BChain-of-TableTaPERADater

Metrics

MT-RAIG EVALSacreBLEUROUGE-LMETEORBERTScoreA3CUTAPAS-AccG-EvalRecall@k

Datasets

MT-RAIG BENCH (this work)SPIDEROpen-WikiTable

Benchmarks

MT-RAIG BENCH

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Dataset scale and structure

MT-RAIG EVAL aligns better with humans than prior automatic metrics

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A realistic benchmark and frozen-web environment for testing web research agents

Key finding

GeneAgent: an LLM agent that queries biology databases to verify and improve gene‑set function explanations

Key finding

Route simple queries straight to fast tools; use memory + planner only for complex job-career requests to cut latency and improve accuracy.

Key finding

SWAN: the first benchmark and baselines for mixing SQL databases with LLMs

Key finding

DQABench: a 200k QA benchmark and modular testbed to measure LLMs on real database questions

Key finding