LIT-RAGBench: a 114-item benchmark testing LLM generators' integration, reasoning, table understanding, logic, and abstention in RAG

March 6, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

Koki Itai, Shunichi Hasegawa, Yuta Yamamoto, Gouki Minegishi, Masaki Otsuki

Links

Abstract / PDF

Why It Matters For Business

LIT-RAGBench pinpoints generator weaknesses you will hit in production—table parsing, multi-document steps, numeric slips, and over‑abstention—so you can choose models and fixes before deployment.

Summary TLDR

LIT-RAGBench is a compact, human-curated benchmark that tests the Generator part of RAG systems across five practical skills: Integration (multi-source use), Reasoning (multi-hop and numeric), Logic (semantic/constraint interpretation), Table (structured data), and Abstention (withhold when evidence is missing). The dataset contains 114 questions (54 Japanese main + 54 English-translated plus abstention variants). Evaluation uses randomized chunk order, fictional entities to avoid memorization, and an LLM-as-a-Judge (GPT-4.1). Across tested API and open models no system exceeded 0.90 overall accuracy; GPT-5 scored 0.872. The benchmark highlights concrete failure modes (unit mismatch, merged‑

Problem Statement

Generators in Retrieval-Augmented Generation must read multiple retrieved documents, reason across them, interpret tables, and avoid hallucinating. Existing benchmarks test pieces of this behavior in isolation. LIT-RAGBench fills the gap by combining practical failure patterns into a single, controllable dataset that isolates the Generator from Retriever errors.

Main Contribution

Definition of five practical Generator categories: Integration, Reasoning, Logic, Table, Abstention.

A human-curated dataset of 114 QA items (Japanese + English translation) using fictional entities to prevent memorized answers.

An evaluation protocol that randomizes chunk order and isolates Generator performance using LLM-as-a-Judge (GPT-4.1).

Open release of dataset, prompts, and code to reproduce experiments.

Key Findings

Benchmark size and composition.

Numbers114 questions (54 main, 54 abstention variants)

No evaluated model exceeded 90% overall accuracy.

Numbersbest overall = 0.872 (GPT-5)

Top-performing model (API) scored 0.872 overall.

NumbersGPT-5 overall accuracy = 0.872

Table understanding is a notable weak point when tables are split or merged.

NumbersGemini-2.5-Flash table score = 0.903 (en)

Some models over-abstain; abstention tradeoff exists between safety and usefulness.

NumbersClaude-Sonnet-4 over-abstention avg = 0.259

Numeric reasoning varies by model; some models make arithmetic slip-ups.

Numberso3 solved all numerical-calculation instances in experiments

Automated scoring used an LLM judge and is feasible for closed-ended checks.

NumbersJudge = GPT-4.1 (binary correctness labels)

Results

Accuracy

Value0.872

Top open-weight model overall

Value0.859

Accuracy

Value0.903

Over-Abstention Rate (highest avg)

Value0.259

Who Should Care

What To Try In 7 Days

Run LIT-RAGBench on your generator to profile category-specific errors.

Add preprocessing: normalize units, merge table chunks, and add headers before retrieval.

Adjust prompts to enforce unit formats and calibrated abstention thresholds and retest accuracy vs over-abstention.

Agent Features

Memory

  • Retrieval Memory

Tool Use

  • LLM-as-a-Judge

Frameworks

  • LIT-RAGBench

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Small dataset (114 items) and imbalanced aspect coverage.
  • Relies on LLM-as-a-Judge (GPT-4.1) which can introduce automatic-evaluator bias.
  • English set produced via machine translation and curation; possible translation artifacts.
  • Focuses on Generator isolation; does not measure retriever performance end-to-end.

When Not To Use

  • As the only benchmark for large-scale model selection—sample size is small.
  • To evaluate retriever ranking or retrieval latency.
  • For domains requiring real-world factual grounding rather than fictional test cases.

Failure Modes

  • Over-abstention when models are uncertain despite answerable context.
  • Unit and scale conversion errors (e.g., MB vs GB, million vs billion).
  • Failure to merge or interpret split/merged table cells across chunks.
  • Numeric arithmetic slip-ups in multi-step calculations.
  • Relying on lexical cues only and missing semantically equivalent evidence.

Core Entities

Models

  • GPT-5
  • o3
  • o4-mini
  • GPT-4.1
  • Gemini-2.5-Flash
  • Claude-Sonnet-4
  • Qwen3-235B-A22B-Instruct
  • Qwen3-235B-A22B-Thinking
  • Llama-3.1-8B-Instruct
  • Llama-3.3-70B-Instruct
  • Gemma-3-27B-Instruct

Metrics

  • Accuracy
  • Over-Abstention Rate

Datasets

  • LIT-RAGBench (114 QA; Japanese + English translation)

Benchmarks

  • FRAMES
  • RAGBench
  • RAGTruth
  • RGB
  • HotpotQA
  • JEMHopQA
  • MMQA