LIT-RAGBench: a 114-item benchmark testing LLM generators' integration, reasoning, table understanding, logic, and abstention in RAG

Overview

Decision SnapshotNeeds Validation

The paper provides clear, reproducible evaluation recipes and measurable category scores, but the dataset is small and judge bias from automatic scoring remains a concern.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 0/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 70%

Novelty: 60%

Authors

Koki Itai, Shunichi Hasegawa, Yuta Yamamoto, Gouki Minegishi, Masaki Otsuki

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LIT-RAGBench pinpoints generator weaknesses you will hit in production—table parsing, multi-document steps, numeric slips, and over‑abstention—so you can choose models and fixes before deployment.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist Founder

Summary TLDR

LIT-RAGBench is a compact, human-curated benchmark that tests the Generator part of RAG systems across five practical skills: Integration (multi-source use), Reasoning (multi-hop and numeric), Logic (semantic/constraint interpretation), Table (structured data), and Abstention (withhold when evidence is missing). The dataset contains 114 questions (54 Japanese main + 54 English-translated plus abstention variants). Evaluation uses randomized chunk order, fictional entities to avoid memorization, and an LLM-as-a-Judge (GPT-4.1). Across tested API and open models no system exceeded 0.90 overall accuracy; GPT-5 scored 0.872. The benchmark highlights concrete failure modes (unit mismatch, merged‑

Problem Statement

Generators in Retrieval-Augmented Generation must read multiple retrieved documents, reason across them, interpret tables, and avoid hallucinating. Existing benchmarks test pieces of this behavior in isolation. LIT-RAGBench fills the gap by combining practical failure patterns into a single, controllable dataset that isolates the Generator from Retriever errors.

Main Contribution

Definition of five practical Generator categories: Integration, Reasoning, Logic, Table, Abstention.

A human-curated dataset of 114 QA items (Japanese + English translation) using fictional entities to prevent memorized answers.

Key Findings

Benchmark size and composition.

Numbers114 questions (54 main, 54 abstention variants)

Practical UseRun this compact benchmark quickly to surface practical generator errors before larger-scale testing.

Evidence RefSection 4.2 Statistics

No evaluated model exceeded 90% overall accuracy.

Numbersbest overall = 0.872 (GPT-5)

Practical UseExpect gaps in generator reliability; compare models by category rather than a single overall number.

Evidence RefSection 5.3; Figure 3; Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	0.872	—	—	LIT-RAGBench (ja/en)	GPT-5 achieved 0.872 overall on both languages	Section 5.3; Figure 3; Table 2
Top open-weight model overall	0.859	—	—	LIT-RAGBench (ja/en)	Qwen3-235B-A22B-Instruct = 0.859	Section 5.3; Table 2

What To Try In 7 Days

Run LIT-RAGBench on your generator to profile category-specific errors.

Add preprocessing: normalize units, merge table chunks, and add headers before retrieval.

Adjust prompts to enforce unit formats and calibrated abstention thresholds and retest accuracy vs over-abstention.

Agent Features

Memory

Retrieval Memory

Tool Use

LLM-as-a-Judge

Frameworks

LIT-RAGBench

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/Koki-Itai/LIT-RAGBench

Data URLs

https://github.com/Koki-Itai/LIT-RAGBench

Risks & Boundaries

Limitations

Small dataset (114 items) and imbalanced aspect coverage.

Relies on LLM-as-a-Judge (GPT-4.1) which can introduce automatic-evaluator bias.

When Not To Use

As the only benchmark for large-scale model selection—sample size is small.

To evaluate retriever ranking or retrieval latency.

Failure Modes

Over-abstention when models are uncertain despite answerable context.

Unit and scale conversion errors (e.g., MB vs GB, million vs billion).

Core Entities

Models

GPT-5o3o4-miniGPT-4.1Gemini-2.5-FlashClaude-Sonnet-4Qwen3-235B-A22B-InstructQwen3-235B-A22B-ThinkingLlama-3.1-8B-InstructLlama-3.3-70B-InstructGemma-3-27B-Instruct

Metrics

AccuracyOver-Abstention Rate

Datasets

LIT-RAGBench (114 QA; Japanese + English translation)

Benchmarks

FRAMESRAGBenchRAGTruthRGBHotpotQAJEMHopQAMMQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Benchmark size and composition.

No evaluated model exceeded 90% overall accuracy.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding