Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
LIT-RAGBench pinpoints generator weaknesses you will hit in production—table parsing, multi-document steps, numeric slips, and over‑abstention—so you can choose models and fixes before deployment.
Summary TLDR
LIT-RAGBench is a compact, human-curated benchmark that tests the Generator part of RAG systems across five practical skills: Integration (multi-source use), Reasoning (multi-hop and numeric), Logic (semantic/constraint interpretation), Table (structured data), and Abstention (withhold when evidence is missing). The dataset contains 114 questions (54 Japanese main + 54 English-translated plus abstention variants). Evaluation uses randomized chunk order, fictional entities to avoid memorization, and an LLM-as-a-Judge (GPT-4.1). Across tested API and open models no system exceeded 0.90 overall accuracy; GPT-5 scored 0.872. The benchmark highlights concrete failure modes (unit mismatch, merged‑
Problem Statement
Generators in Retrieval-Augmented Generation must read multiple retrieved documents, reason across them, interpret tables, and avoid hallucinating. Existing benchmarks test pieces of this behavior in isolation. LIT-RAGBench fills the gap by combining practical failure patterns into a single, controllable dataset that isolates the Generator from Retriever errors.
Main Contribution
Definition of five practical Generator categories: Integration, Reasoning, Logic, Table, Abstention.
A human-curated dataset of 114 QA items (Japanese + English translation) using fictional entities to prevent memorized answers.
An evaluation protocol that randomizes chunk order and isolates Generator performance using LLM-as-a-Judge (GPT-4.1).
Open release of dataset, prompts, and code to reproduce experiments.
Key Findings
Benchmark size and composition.
No evaluated model exceeded 90% overall accuracy.
Top-performing model (API) scored 0.872 overall.
Table understanding is a notable weak point when tables are split or merged.
Some models over-abstain; abstention tradeoff exists between safety and usefulness.
Numeric reasoning varies by model; some models make arithmetic slip-ups.
Automated scoring used an LLM judge and is feasible for closed-ended checks.
Results
Accuracy
Top open-weight model overall
Accuracy
Over-Abstention Rate (highest avg)
Who Should Care
What To Try In 7 Days
Run LIT-RAGBench on your generator to profile category-specific errors.
Add preprocessing: normalize units, merge table chunks, and add headers before retrieval.
Adjust prompts to enforce unit formats and calibrated abstention thresholds and retest accuracy vs over-abstention.
Agent Features
Memory
- Retrieval Memory
Tool Use
- LLM-as-a-Judge
Frameworks
- LIT-RAGBench
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Small dataset (114 items) and imbalanced aspect coverage.
- Relies on LLM-as-a-Judge (GPT-4.1) which can introduce automatic-evaluator bias.
- English set produced via machine translation and curation; possible translation artifacts.
- Focuses on Generator isolation; does not measure retriever performance end-to-end.
When Not To Use
- As the only benchmark for large-scale model selection—sample size is small.
- To evaluate retriever ranking or retrieval latency.
- For domains requiring real-world factual grounding rather than fictional test cases.
Failure Modes
- Over-abstention when models are uncertain despite answerable context.
- Unit and scale conversion errors (e.g., MB vs GB, million vs billion).
- Failure to merge or interpret split/merged table cells across chunks.
- Numeric arithmetic slip-ups in multi-step calculations.
- Relying on lexical cues only and missing semantically equivalent evidence.
Core Entities
Models
- GPT-5
- o3
- o4-mini
- GPT-4.1
- Gemini-2.5-Flash
- Claude-Sonnet-4
- Qwen3-235B-A22B-Instruct
- Qwen3-235B-A22B-Thinking
- Llama-3.1-8B-Instruct
- Llama-3.3-70B-Instruct
- Gemma-3-27B-Instruct
Metrics
- Accuracy
- Over-Abstention Rate
Datasets
- LIT-RAGBench (114 QA; Japanese + English translation)
Benchmarks
- FRAMES
- RAGBench
- RAGTruth
- RGB
- HotpotQA
- JEMHopQA
- MMQA

