RAGBench: 100k explainable RAG examples plus TRACe — practical metrics to audit retriever+generator systems

June 25, 20247 min

Overview

Decision SnapshotReady For Pilot

The dataset and metrics are practical and well-validated against humans, but labels rely on a GPT-4 annotator and domain coverage is limited to the included verticals.

Citations8

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 55%

Authors

Robert Friel, Masha Belyi, Atindriyo Sanyal

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RAGBench + TRACe gives a unified, explainable way to audit retriever and generator components, reducing costly trial-and-error and surfacing whether errors come from the retriever, the generator, or both.

Who Should Care

Summary TLDR

RAGBench is a 100k-example, multi-domain dataset and evaluation suite for Retrieval-Augmented Generation (RAG). The authors introduce TRACe — four actionable metrics (Utilization, Relevance, Adherence, Completeness) — and release the labeled data and code. Labels are created with a GPT-4 annotator and validated against human judgments. A 400M-parameter DeBERTa model fine-tuned on RAGBench outperforms few-shot LLM judges on several RAG evaluation tasks on the provided test splits. The benchmark targets industry-style docs (manuals, contracts, papers) and aims to make RAG evaluation more granular and reproducible.

Problem Statement

There is no unified, large-scale, cross-domain benchmark or set of explainable metrics for evaluating RAG systems. Existing datasets are small, label sets are inconsistent, and many evaluation pipelines use LLMs to label data, which hinders reproducibility and practical system tuning.

Main Contribution

A large standardized RAG dataset (RAGBench) of ~100k examples from 12 component datasets across five industry-relevant domains.

TRACe — a concise, explainable RAG evaluation framework: Utilization, Relevance, Adherence, Completeness.

Key Findings

RAGBench totals approximately 100k labeled RAG examples.

Numbers100k total; Train 78k / Val 12k / Test 11k

Practical UseYou can train and evaluate RAG evaluators on a single, large cross-domain dataset instead of many small task-specific sets.

Evidence RefTable 1, §3.1

TRACe formalizes four RAG metrics that separate retriever vs generator behavior.

NumbersMetrics: Utilization, Relevance, Adherence, Completeness

Practical UseUse TRACe to get actionable, component-level signals (e.g., low Relevance → retriever tuning; low Utilization → generator/prompt work).

Evidence Ref§3.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Dataset size≈100k examples (Train 78k / Val 12k / Test 11k)RAGBenchTable 1, §3.1Table 1
GPT-4 annotator alignment (adherence)Example-level Acc 0.93; Span-level Acc 0.95human labelsDelucionQA testTable 2, §3.4Table 2

What To Try In 7 Days

Run TRACe metrics on a small sample of your RAG queries to separate retriever vs generator issues.

Use the RAGBench Hugging Face dataset to fine-tune a small NLI-style judge (e.g., DeBERTa) and compare to your LLM prompts.

Validate a GPT-4 labeling pipeline on a 200–500 human-annotated subset before scaling auto-labeling for your domain.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Labels are generated primarily by a GPT-4 annotator; despite high alignment, automatic labels can mis-handle partially-supported sentences.

Domain coverage is focused on 5 verticals; other domains or languages may behave differently.

When Not To Use

As undisputed ground truth without human validation — auto-labeler mistakes remain.

For non-English RAG systems — benchmark focuses on English sources.

Failure Modes

Partially-supported sentences labeled inconsistently, causing adherence misclassification.

Relevance labels mislead when retrieved docs are semantically related but lack required facts.

Core Entities

Models

gpt-3.5-turbogpt-4gpt-4oClaude 3 HaikuDeBERTa-v3-Large (400M)TF-IDF retriever

Metrics

RelevanceUtilizationAdherenceCompletenessAUROCRMSE

Datasets

RAGBenchPubMedQACovidQA-RAGHotpotQAMS MarcoHAGRIDExpertQACUADDelucionQAEManualTechQAFinQATAT-QA

Benchmarks

DelucionQARAGTruthRGBAttributionBench