RAGBench: 100k explainable RAG examples plus TRACe — practical metrics to audit retriever+generator systems

June 25, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.55

Cost Impact Score

0.6

Citation Count

8

Authors

Robert Friel, Masha Belyi, Atindriyo Sanyal

Links

Abstract / PDF

Why It Matters For Business

RAGBench + TRACe gives a unified, explainable way to audit retriever and generator components, reducing costly trial-and-error and surfacing whether errors come from the retriever, the generator, or both.

Summary TLDR

RAGBench is a 100k-example, multi-domain dataset and evaluation suite for Retrieval-Augmented Generation (RAG). The authors introduce TRACe — four actionable metrics (Utilization, Relevance, Adherence, Completeness) — and release the labeled data and code. Labels are created with a GPT-4 annotator and validated against human judgments. A 400M-parameter DeBERTa model fine-tuned on RAGBench outperforms few-shot LLM judges on several RAG evaluation tasks on the provided test splits. The benchmark targets industry-style docs (manuals, contracts, papers) and aims to make RAG evaluation more granular and reproducible.

Problem Statement

There is no unified, large-scale, cross-domain benchmark or set of explainable metrics for evaluating RAG systems. Existing datasets are small, label sets are inconsistent, and many evaluation pipelines use LLMs to label data, which hinders reproducibility and practical system tuning.

Main Contribution

A large standardized RAG dataset (RAGBench) of ~100k examples from 12 component datasets across five industry-relevant domains.

TRACe — a concise, explainable RAG evaluation framework: Utilization, Relevance, Adherence, Completeness.

An automated GPT-4-based annotation pipeline (LLM-annotator) with validation against human labels.

Baselines and experiments showing a fine-tuned 400M DeBERTa model outperforms few-shot LLM judges on the benchmark.

Public release of dataset and inference/eval code on Hugging Face and GitHub.

Key Findings

RAGBench totals approximately 100k labeled RAG examples.

Numbers100k total; Train 78k / Val 12k / Test 11k

TRACe formalizes four RAG metrics that separate retriever vs generator behavior.

NumbersMetrics: Utilization, Relevance, Adherence, Completeness

GPT-4 annotator aligns strongly with humans on DelucionQA for adherence and span labels.

NumbersExample-level adherence Acc 0.93, span-level Acc 0.95; Utilization F1 0.92

A fine-tuned DeBERTa judge outperforms zero-/few-shot LLM judges on hallucination detection across domains.

NumbersDeBERTa AUROC 0.64–0.86 vs GPT-3.5 ~0.51–0.65 on test splits

Context relevance is harder to predict than utilization.

NumbersRelevance RMSE range ~0.08–0.27; Utilization RMSE range ~0.04–0.23

Component datasets show varied hallucination rates.

NumbersHallucination fractions span 1%–20%; e.g., CovidQA 16%, ExpertQA 12%, CUAD ~1%

Results

Dataset size

Value≈100k examples (Train 78k / Val 12k / Test 11k)

GPT-4 annotator alignment (adherence)

ValueExample-level Acc 0.93; Span-level Acc 0.95

Baselinehuman labels

Hallucination detection (AUROC)

ValueDeBERTa AUROC 0.64–0.86 across domains

BaselineGPT-3.5 judges ~0.51–0.65

Relevance prediction error

ValueRMSE ≈0.08–0.27 depending on domain

Who Should Care

What To Try In 7 Days

Run TRACe metrics on a small sample of your RAG queries to separate retriever vs generator issues.

Use the RAGBench Hugging Face dataset to fine-tune a small NLI-style judge (e.g., DeBERTa) and compare to your LLM prompts.

Validate a GPT-4 labeling pipeline on a 200–500 human-annotated subset before scaling auto-labeling for your domain.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Labels are generated primarily by a GPT-4 annotator; despite high alignment, automatic labels can mis-handle partially-supported sentences.
  • Domain coverage is focused on 5 verticals; other domains or languages may behave differently.
  • Relevance is intrinsically hard; higher prediction error expected versus utilization.
  • Some generated responses used proprietary LLMs (GPT-3.5, Claude), which may limit exact reproducibility for others.

When Not To Use

  • As undisputed ground truth without human validation — auto-labeler mistakes remain.
  • For non-English RAG systems — benchmark focuses on English sources.
  • If you need judgment on tasks far from QA (e.g., creative generation) where TRACe metrics are not applicable.

Failure Modes

  • Partially-supported sentences labeled inconsistently, causing adherence misclassification.
  • Relevance labels mislead when retrieved docs are semantically related but lack required facts.
  • Long legal documents (CUAD) challenge some generation and retrieval limits.

Core Entities

Models

  • gpt-3.5-turbo
  • gpt-4
  • gpt-4o
  • Claude 3 Haiku
  • DeBERTa-v3-Large (400M)
  • TF-IDF retriever

Metrics

  • Relevance
  • Utilization
  • Adherence
  • Completeness
  • AUROC
  • RMSE

Datasets

  • RAGBench
  • PubMedQA
  • CovidQA-RAG
  • HotpotQA
  • MS Marco
  • HAGRID
  • ExpertQA
  • CUAD
  • DelucionQA
  • EManual
  • TechQA
  • FinQA
  • TAT-QA

Benchmarks

  • DelucionQA
  • RAGTruth
  • RGB
  • AttributionBench