LIT-RAGBench: a 114-item benchmark testing LLM generators' integration, reasoning, table understanding, logic, and abstention in RAG

March 6, 20267 min

Overview

Decision SnapshotNeeds Validation

The paper provides clear, reproducible evaluation recipes and measurable category scores, but the dataset is small and judge bias from automatic scoring remains a concern.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 70%

Novelty: 60%

Authors

Koki Itai, Shunichi Hasegawa, Yuta Yamamoto, Gouki Minegishi, Masaki Otsuki

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LIT-RAGBench pinpoints generator weaknesses you will hit in production—table parsing, multi-document steps, numeric slips, and over‑abstention—so you can choose models and fixes before deployment.

Who Should Care

Summary TLDR

LIT-RAGBench is a compact, human-curated benchmark that tests the Generator part of RAG systems across five practical skills: Integration (multi-source use), Reasoning (multi-hop and numeric), Logic (semantic/constraint interpretation), Table (structured data), and Abstention (withhold when evidence is missing). The dataset contains 114 questions (54 Japanese main + 54 English-translated plus abstention variants). Evaluation uses randomized chunk order, fictional entities to avoid memorization, and an LLM-as-a-Judge (GPT-4.1). Across tested API and open models no system exceeded 0.90 overall accuracy; GPT-5 scored 0.872. The benchmark highlights concrete failure modes (unit mismatch, merged‑

Problem Statement

Generators in Retrieval-Augmented Generation must read multiple retrieved documents, reason across them, interpret tables, and avoid hallucinating. Existing benchmarks test pieces of this behavior in isolation. LIT-RAGBench fills the gap by combining practical failure patterns into a single, controllable dataset that isolates the Generator from Retriever errors.

Main Contribution

Definition of five practical Generator categories: Integration, Reasoning, Logic, Table, Abstention.

A human-curated dataset of 114 QA items (Japanese + English translation) using fictional entities to prevent memorized answers.

Key Findings

Benchmark size and composition.

Numbers114 questions (54 main, 54 abstention variants)

Practical UseRun this compact benchmark quickly to surface practical generator errors before larger-scale testing.

Evidence RefSection 4.2 Statistics

No evaluated model exceeded 90% overall accuracy.

Numbersbest overall = 0.872 (GPT-5)

Practical UseExpect gaps in generator reliability; compare models by category rather than a single overall number.

Evidence RefSection 5.3; Figure 3; Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy0.872LIT-RAGBench (ja/en)GPT-5 achieved 0.872 overall on both languagesSection 5.3; Figure 3; Table 2
Top open-weight model overall0.859LIT-RAGBench (ja/en)Qwen3-235B-A22B-Instruct = 0.859Section 5.3; Table 2

What To Try In 7 Days

Run LIT-RAGBench on your generator to profile category-specific errors.

Add preprocessing: normalize units, merge table chunks, and add headers before retrieval.

Adjust prompts to enforce unit formats and calibrated abstention thresholds and retest accuracy vs over-abstention.

Agent Features

Memory
Retrieval Memory
Tool Use
LLM-as-a-Judge
Frameworks
LIT-RAGBench

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Small dataset (114 items) and imbalanced aspect coverage.

Relies on LLM-as-a-Judge (GPT-4.1) which can introduce automatic-evaluator bias.

When Not To Use

As the only benchmark for large-scale model selection—sample size is small.

To evaluate retriever ranking or retrieval latency.

Failure Modes

Over-abstention when models are uncertain despite answerable context.

Unit and scale conversion errors (e.g., MB vs GB, million vs billion).

Core Entities

Models

GPT-5o3o4-miniGPT-4.1Gemini-2.5-FlashClaude-Sonnet-4Qwen3-235B-A22B-InstructQwen3-235B-A22B-ThinkingLlama-3.1-8B-InstructLlama-3.3-70B-InstructGemma-3-27B-Instruct

Metrics

AccuracyOver-Abstention Rate

Datasets

LIT-RAGBench (114 QA; Japanese + English translation)

Benchmarks

FRAMESRAGBenchRAGTruthRGBHotpotQAJEMHopQAMMQA