Overview
The paper provides clear, reproducible evaluation recipes and measurable category scores, but the dataset is small and judge bias from automatic scoring remains a concern.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals12
Trust Signals
Findings with numeric evidence: 7/7
Findings with evidence refs: 7/7
Results with explicit delta: 0/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 40%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
LIT-RAGBench pinpoints generator weaknesses you will hit in production—table parsing, multi-document steps, numeric slips, and over‑abstention—so you can choose models and fixes before deployment.
Who Should Care
Summary TLDR
LIT-RAGBench is a compact, human-curated benchmark that tests the Generator part of RAG systems across five practical skills: Integration (multi-source use), Reasoning (multi-hop and numeric), Logic (semantic/constraint interpretation), Table (structured data), and Abstention (withhold when evidence is missing). The dataset contains 114 questions (54 Japanese main + 54 English-translated plus abstention variants). Evaluation uses randomized chunk order, fictional entities to avoid memorization, and an LLM-as-a-Judge (GPT-4.1). Across tested API and open models no system exceeded 0.90 overall accuracy; GPT-5 scored 0.872. The benchmark highlights concrete failure modes (unit mismatch, merged‑
Problem Statement
Generators in Retrieval-Augmented Generation must read multiple retrieved documents, reason across them, interpret tables, and avoid hallucinating. Existing benchmarks test pieces of this behavior in isolation. LIT-RAGBench fills the gap by combining practical failure patterns into a single, controllable dataset that isolates the Generator from Retriever errors.
Main Contribution
Definition of five practical Generator categories: Integration, Reasoning, Logic, Table, Abstention.
A human-curated dataset of 114 QA items (Japanese + English translation) using fictional entities to prevent memorized answers.
Key Findings
Benchmark size and composition.
No evaluated model exceeded 90% overall accuracy.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 0.872 | — | — | LIT-RAGBench (ja/en) | GPT-5 achieved 0.872 overall on both languages | Section 5.3; Figure 3; Table 2 |
| Top open-weight model overall | 0.859 | — | — | LIT-RAGBench (ja/en) | Qwen3-235B-A22B-Instruct = 0.859 | Section 5.3; Table 2 |
What To Try In 7 Days
Run LIT-RAGBench on your generator to profile category-specific errors.
Add preprocessing: normalize units, merge table chunks, and add headers before retrieval.
Adjust prompts to enforce unit formats and calibrated abstention thresholds and retest accuracy vs over-abstention.
Agent Features
Memory
Tool Use
Frameworks
Reproducibility
Risks & Boundaries
Limitations
Small dataset (114 items) and imbalanced aspect coverage.
Relies on LLM-as-a-Judge (GPT-4.1) which can introduce automatic-evaluator bias.
When Not To Use
As the only benchmark for large-scale model selection—sample size is small.
To evaluate retriever ranking or retrieval latency.
Failure Modes
Over-abstention when models are uncertain despite answerable context.
Unit and scale conversion errors (e.g., MB vs GB, million vs billion).

