A large benchmark and finer evaluation method for generating grounded insights that pull evidence from multiple tables

February 17, 20257 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.3

Citation Count

0

Authors

Kwangwook Seo, Donguk Kwon, Dongha Lee

Links

Abstract / PDF

Why It Matters For Business

If your product needs explainable insights from multiple tables, this benchmark and evaluator help measure real-world retrieval+analysis performance and reveal where current LLMs fail to ground facts.

Summary TLDR

The authors release MT-RAIG BENCH, a large-scale dataset (18.5k test examples) for retrieval-augmented insight generation across multiple tables, and MT-RAIG EVAL, a decomposition-based automatic evaluator that better matches human judgments for faithfulness and completeness. Experiments show current retrievers and LLMs struggle: retrieval trade-offs and noisy tables hurt factual grounding, and even top LLMs reach only ~40% faithfulness and ~60% completeness on gold tables. Use DPR-style retrieval, limit retrieved tables to the model's context window, and measure with MT-RAIG EVAL for closer alignment to humans.

Problem Statement

Existing table reasoning tests assume a single gold table is given. Real users usually need insights that draw evidence from many unknown tables and need reliable automatic checks for faithfulness and completeness.

Main Contribution

MT-RAIG BENCH: a large benchmark (18,532 test examples) for retrieval-augmented insight generation across multiple tables.

MT-RAIG EVAL: a decomposition-based automatic evaluator that breaks insights into table-linked claims and question-aware topics to score faithfulness and completeness.

Comprehensive baseline study showing retrieval and generation gaps and characterizing noise, scaling, and retriever effects.

Key Findings

Dataset scale and structure

Numbers18,532 test examples; 19,563 unique tables; avg gold tables / example = 2.88

MT-RAIG EVAL aligns better with humans than prior automatic metrics

NumbersPearson corr: Faithfulness 64.94, Completeness 67.67 (MT-RAIG EVAL) vs G-Eval Faith. 47.82

General-purpose dense retrieval (DPR) outperforms table-specific retrievers on this task

NumbersDPR top-10 recall = 80.83% vs DTR 74.50% and TableLlama 72.44%

Generation remains weak on faithfulness even with gold tables

NumbersTop LLMs show ≈40% faithfulness and ≈60% completeness on provided gold tables (closed-domain)

Adding more retrieved tables helps up to a point, then hurts

NumbersGeneration improves with more k then plateaus or declines; faithfulness drops with higher ratio of irrelevant tables

Results

MT-RAIG EVAL Pearson correlation with human preference

ValueFaithfulness 64.94; Completeness 67.67

BaselineG-Eval (Faithfulness 47.82)

Multi-table retrieval recall (top-10)

ValueDPR R@10 = 80.83%

BaselineDTR R@10 = 74.50%

Model generation performance (closed-domain, gold tables)

ValueFaithfulness ≈ 40%; Completeness ≈ 60%

Benchmark size and complexity

Value18,532 test examples; avg words/insight 189.87; avg gold tables/example 2.88

BaselineExisting table benchmarks are smaller or single-table

Who Should Care

What To Try In 7 Days

Run DPR-style dense retrieval with top-10 tables and evaluate generation with MT-RAIG EVAL to compare systems quickly

Tune the number of retrieved tables (k) and measure faithfulness vs completeness to find the sweet spot for your model

Add an automatic claim-level verifier (table-aware decomposition) to catch ungrounded facts before user exposure

Agent Features

Tool Use

  • LLM-based annotator and verifier used for annotation and evaluation

Optimization Features

Training Optimization

  • LoRA

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • MT-RAIG BENCH relies partly on machine-generated questions and insights; this can reduce linguistic diversity or create alignment with generator artifacts.
  • Dataset covers relational DB and Wikipedia tables only; performance in specialized domains (finance, medical) is untested.
  • MT-RAIG EVAL uses LLM-based decomposer/verifier and can inherit backbone biases despite reproducibility checks.

When Not To Use

  • When you only need single-table fact extraction (single-table TQA)
  • For high-stakes production without an added human or programmatic verifier for claims
  • When you require domain-specific tables not covered by the benchmark (e.g., medical tables) without validation

Failure Modes

  • Models hallucinate facts when irrelevant tables are present or retrieval misses key tables
  • Retrieval noise reduces faithfulness even if completeness stays acceptable
  • Automatic evaluation can be biased by the backbone LLM used for MT-RAIG EVAL

Core Entities

Models

  • DPR
  • BM25
  • Contriever
  • DTR
  • TableLlama
  • o3-mini
  • GPT-4o
  • Claude 3.5 Sonnet
  • DeepSeek-R1
  • Qwen2-7B
  • Gemma-7B
  • Llama 3.1-8B
  • Mistral-7B
  • Chain-of-Table
  • TaPERA
  • Dater

Metrics

  • MT-RAIG EVAL
  • SacreBLEU
  • ROUGE-L
  • METEOR
  • BERTScore
  • A3CU
  • TAPAS-Acc
  • G-Eval
  • Recall@k

Datasets

  • MT-RAIG BENCH (this work)
  • SPIDER
  • Open-WikiTable

Benchmarks

  • MT-RAIG BENCH