Overview
Production Readiness
0.7
Novelty Score
0.55
Cost Impact Score
0.6
Citation Count
8
Why It Matters For Business
RAGBench + TRACe gives a unified, explainable way to audit retriever and generator components, reducing costly trial-and-error and surfacing whether errors come from the retriever, the generator, or both.
Summary TLDR
RAGBench is a 100k-example, multi-domain dataset and evaluation suite for Retrieval-Augmented Generation (RAG). The authors introduce TRACe — four actionable metrics (Utilization, Relevance, Adherence, Completeness) — and release the labeled data and code. Labels are created with a GPT-4 annotator and validated against human judgments. A 400M-parameter DeBERTa model fine-tuned on RAGBench outperforms few-shot LLM judges on several RAG evaluation tasks on the provided test splits. The benchmark targets industry-style docs (manuals, contracts, papers) and aims to make RAG evaluation more granular and reproducible.
Problem Statement
There is no unified, large-scale, cross-domain benchmark or set of explainable metrics for evaluating RAG systems. Existing datasets are small, label sets are inconsistent, and many evaluation pipelines use LLMs to label data, which hinders reproducibility and practical system tuning.
Main Contribution
A large standardized RAG dataset (RAGBench) of ~100k examples from 12 component datasets across five industry-relevant domains.
TRACe — a concise, explainable RAG evaluation framework: Utilization, Relevance, Adherence, Completeness.
An automated GPT-4-based annotation pipeline (LLM-annotator) with validation against human labels.
Baselines and experiments showing a fine-tuned 400M DeBERTa model outperforms few-shot LLM judges on the benchmark.
Public release of dataset and inference/eval code on Hugging Face and GitHub.
Key Findings
RAGBench totals approximately 100k labeled RAG examples.
TRACe formalizes four RAG metrics that separate retriever vs generator behavior.
GPT-4 annotator aligns strongly with humans on DelucionQA for adherence and span labels.
A fine-tuned DeBERTa judge outperforms zero-/few-shot LLM judges on hallucination detection across domains.
Context relevance is harder to predict than utilization.
Component datasets show varied hallucination rates.
Results
Dataset size
GPT-4 annotator alignment (adherence)
Hallucination detection (AUROC)
Relevance prediction error
Who Should Care
What To Try In 7 Days
Run TRACe metrics on a small sample of your RAG queries to separate retriever vs generator issues.
Use the RAGBench Hugging Face dataset to fine-tune a small NLI-style judge (e.g., DeBERTa) and compare to your LLM prompts.
Validate a GPT-4 labeling pipeline on a 200–500 human-annotated subset before scaling auto-labeling for your domain.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Labels are generated primarily by a GPT-4 annotator; despite high alignment, automatic labels can mis-handle partially-supported sentences.
- Domain coverage is focused on 5 verticals; other domains or languages may behave differently.
- Relevance is intrinsically hard; higher prediction error expected versus utilization.
- Some generated responses used proprietary LLMs (GPT-3.5, Claude), which may limit exact reproducibility for others.
When Not To Use
- As undisputed ground truth without human validation — auto-labeler mistakes remain.
- For non-English RAG systems — benchmark focuses on English sources.
- If you need judgment on tasks far from QA (e.g., creative generation) where TRACe metrics are not applicable.
Failure Modes
- Partially-supported sentences labeled inconsistently, causing adherence misclassification.
- Relevance labels mislead when retrieved docs are semantically related but lack required facts.
- Long legal documents (CUAD) challenge some generation and retrieval limits.
Core Entities
Models
- gpt-3.5-turbo
- gpt-4
- gpt-4o
- Claude 3 Haiku
- DeBERTa-v3-Large (400M)
- TF-IDF retriever
Metrics
- Relevance
- Utilization
- Adherence
- Completeness
- AUROC
- RMSE
Datasets
- RAGBench
- PubMedQA
- CovidQA-RAG
- HotpotQA
- MS Marco
- HAGRID
- ExpertQA
- CUAD
- DelucionQA
- EManual
- TechQA
- FinQA
- TAT-QA
Benchmarks
- DelucionQA
- RAGTruth
- RGB
- AttributionBench

